data migration with symfony & doctrine

Scene from "Apocalypse Now" by Francis Ford Coppola (1979)

Diving into 'outsourcing applications with symfony': migrating data

This post was inspired by Fizyk's comment on the outsourcing applications with symfony article. Fizyk suggested that it would be a good idea to wrap all data migration functionalities into a symfony task(s). Basically, migrating data between different versions of the same project is the topic I want to focus on in this article.

new version of a project, written from scratch

In the previous article, we've been discussing developing a new version of an old project from scratch (e.g. because the old one was so inflexible that it was unable to be extended). The first thing we should do is to design new database structure that will hold all data that the present project holds (I assume that the database structure will have to be different, because if it was good, no one would redevelop entire project from scratch). Depending on your luck, it can be an easy task or a nightmare [I've already had redeveloped 2 projects with databases that were not even in the first normal form... yes, there are projects like this. Check this out: one of these projects were made by a really bad developer who never used any type of version control system. He used no development environment (only production) and no localhost machines to test new features before they are added. He made all modifications on the production server, even if it took 2 weeks until they were finished. And there was only FTP access on the server... Oh, yes, there ae projects like that... and you may be asked to improve such projects :)].

data migration

Now when we have two parallel database structures, we need to copy all data from the old one to the new one. Depending on the data (its size, structure, relations, constraints, etc.), I use one of the following:

pure Doctrine (Doctrine in - Doctrine out)
mixed (Doctrine in - SQL out)
pure SQL (SQL query)

For each data structure, I choose one of the above migration types very carefully (detailed descriptions below). If the project is big (and I assume this is the case), the development phase takes a lot of time. It's very important that you can run entire migration job:

as fast as possible: this enables you to run the migration very often, because it doesn't take too much time. And this gives you the possibility to improve the scripts, test them carefully (each mistake made during the migration will be painful when new project is started in production environment). And, finally, you can download the production (old project) database and migrate it to the new structure to compare the frontend of both projects - this helps to develop the business logic,
with wrapped interface: click once - make yourself a cup of tea - come back in 10 minutes - click again - go out with your dog - come back in 15 minutes - the migration is done. Or otherwise: imagine you have a project with 50 tables and you have to run 50 actions (e.g. by a click), each one for each table - or call 50 symfony tasks, not to mention that you need to call them in appropriate order (because data is related in RDBMS) and you have to watch out not to break the order. Of course, there has to be 50 small jobs, deep inside the migration system, but your interface should do the work for you automatically (check an example below).
avoid repeating the same manual modifications. This is difficult to discuss in general - I'll give an example. I've had an action table which stored users activity. The old project database had a record which stated that user X has replied to topic Y with post Z. And neither topic Y nor post Z existed in the database (again, if the project is so bad that it has to be redeveloped from scratch, such bad mistakes are very common). Of course, the migration system failed each time when symfony tried to insert a record that was related to a non-existent record. What to do?
- bad solution: remove this action record from you old database local copy (you download the old project database from the production server, load it to your SQL localhost machine and everythng is 100% safe)
- good solution: remove this record from production database, because it already points some stupid and non-existing stuff.
Why to do so? Because if such mistakes were made in the old project (data inconsistency), there can be quite a lot of such paniful action table, which will make a single migration job really long. And after you have done the whole migration 2 or 3 times, it'll be enough or dumb work and you won't want to do it any more. Again, this is only a simple example. Each situation will probably need a different solution

There is one more important thing: no matter how complicated your migration job is, always create a detailed list of steps to make, describing what is being done at the moment, why such order and what is already migrated/what is still waiting to be migrated. With such list, performing data migration is really easy and unstressful :)

Here you have an example of a migration job order from a project I did recently (all .sql files are 'pure SQL' migrations):

where the action http://polonia/backend_dev.php/migrator/migrateAll calls the following (these are the only 'pure Doctrine' migrations):

MigrationManager::migrateConfig();
MigrationManager::migrateForbiddenPhrases();
MigrationManager::migrateQuotes();
MigrationManager::migrateArticles();
MigrationManager::migrateReadingCategories();
MigrationManager::migrateReadingTextbooks();
MigrationManager::migrateCountries();
MigrationManager::migrateRegions();
MigrationManager::migrateSubpages();

pure Doctrine migration

This is the only code example in the outsourcing applications with symfony article. And this is probably the only migration type that would be wrapped in a symfony task, as Fizyk suggested. This is really easy - just fetch data from the old system and save it in the new system.

used for small amounts of data

Below is another example of this data migration technique - private messages sent between users of a social website:

static public function migrateMessages()
  {
    $wiadomosci = Doctrine_Query::create()
      ->from('Wiadomosci w')
      ->orderBy('w.wiad_id')
      ->fetchArray();
 
    foreach ($wiadomosci as $row)
    {
      $message = new Message();
      $message->setId($row['wiad_id']);
      $message->setTitle($row['temat']);
      $message->setText($row['tresc']);
      $message->setFromId($row['wiad_od']);
      $message->setToId($row['wiad_do']);
      $message->setDisplayed($row['przeczytane']);
      $message->setProvoke($row['zaczep'] == 't');
      $message->setCreatedAt($row['kolumna_data']);
      $message->setUpdatedAt($row['kolumna_data']);
      $message->save();
    }
  }

pure SQL migration

You may ask, why native SQL INSERT queries are constructed to be executed immediately, instead of using Doctrine? The answer is obvious: time difference. Doctrine is nice, but in fact, it is really slow. The problem with pure Doctrine migration is that it takes a lot of time to load all data with PHP/Doctrine and then to insert it into new structure. The table from the example above held over 50'000 records. At my local machine, it took several minutes to finish the job. It's too much (data migration should be as fast as possible). The solution to this problem is simply to create a sql script that will do the job thousand times faster:

INSERT INTO NEW_DB.message (
  id,
  title,
  text,
  from_id,
  to_id,
  displayed,
  provoke,
  created_at,
  updated_at
)
SELECT
  wiad_id,
  temat,
  tresc,
  wiad_od,
  wiad_do,
  przeczytane,
  IF(zaczep = 't', 1, 0),
  kolumna_data,
  kolumna_data
FROM OLD_DB.wiadomosci
ORDER BY wiad_id

Sometimes the case may be even worse. The action table, mentioned before, held over 400'000 records and the PHP/apache configuration would need really big limits (and the migration would take several hours to finish). The following is another sql script, migrating action data:

INSERT INTO NEW_DB.action (
  created_at,
  created_by,
  type_id_external,
  target_profile_id,
  forum_post_id,
  forum_topic_id
)
SELECT kolumna_data, u_id,
CASE akcja
  WHEN "profil" THEN 1
  WHEN "logowanie" THEN 2
  WHEN "wylogowanie" THEN 3
  WHEN "pm" THEN 4
  WHEN "zaczep" THEN 5
  WHEN "post" THEN 6
  WHEN "watek" THEN 7
END,
  IF (akcja = 'profil' OR akcja = 'pm' OR akcja = 'zaczep', akcja_id, NULL),
  IF (akcja = 'post', post_id, NULL),
  IF (akcja = 'watek', akcja_id, NULL)
FROM OLD_DB.akcje ORDER BY kolumna_data ASC

Pure SQL migration may be a little bit more complicated (processing data in PHP is easier) - but you'll surely save a lot of time.

used for huge amounts of data, as this can be the only possibility to migrate it

mixed migration: Doctrine IN / SQL OUT

Sometimes it can be impossible to create a magnificent "INSERT INTO SELECT FROM" query. PHP, although being as slow as a dead snail (in comparison to SQL database management systems), still gives more possibilities just because being an imperative programming language. The idea of this migration type is to fetch the old system data using Doctrine, generate the SQL code and execute it. You may execute such generated script in any way you like - db console, tools like phpmyadmin or even forcing Doctrine to execute raw SQL. If someone prefers, this can be also done as a symfony task - but if the project is really big, you don't benefit from wrapping such mechanism into a symfony task.

To fully understand the idea of this data migration type, take a look at the example below:

static public function getInexistentProfileIds()
  {
    $u_ids = Doctrine_Query::create()
      ->from('Uzytkownicy u')
      ->select('u.u_id')
      ->orderBy('u.u_id')
      ->fetchArray();
 
    $ids = range(1, $u_ids[count($u_ids)-1]['u_id']);
 
    foreach ($u_ids as $u_id)
      unset($ids[$u_id['u_id'] - 1]);
 
    return $ids;
  }

static public function generateSqlToRebuildUsers()
  {
    $ids = self::getInexistentProfileIds();
    $sql = 'INSERT INTO profile (`id`) VALUES ('.implode('), (', $ids).')';
    echo $sql;
    echo '<hr />';
    $sql = 'INSERT INTO sf_guard_user (`id`, `email_address`, `username`,`is_active`) VALUES (';
    foreach ($ids as $id)
      $sql .= $id.',"NULL-'.$id.'","NULL-'.$id.'", 0), (';
    echo $sql;
  }

The reason for doing mixed migration in this case is data inconsistency in the old project. Users were creating forum topics and posts. And when a user was deleted, posts and topics still referred to it. Migrating forum data returned dozens of errors, all of them was failing constraint - assigning topic/post to a non-existent user. The solution was to create dead user records (with deleted_at set), just for the SQL not to break the constraint. The first method fetches all IDs of the users that has been deleted, the second part generates the SQL code creating those dead records into the database. This is only an example, of course, and the example above may be found controversial :), but, believe me, migrating an entire system (which was really, really badly developed) is a difficult task itself, the time you have is limited and sometimes you are forced to use somehow brutal solutions. Anyway, it's up to you how you want to grab the generated SQL code. For me it was the easiest way to display it in the browser, copy-paste to phpmyadmin and execute. But you can generate a sql file, wrap it into a symfony task or do whatever you think of. The main idea is just to make the data migration faster.

and that's it

I hope that some of my hints will be useful if you ever need to migrate data between different versions of the same project. Comments are welcome :)

faster queries: indexing tables

When designing complex web applications, you have to pay attention to the project performance to make the framework handle your request as fast as possible. This involves optimising client side (clean CSS, clean HTML, fast Javascript, etc.) and server side (caching templates and queries, usage of the database and many others). We will concentrate on the database here. In short, the database should have such structure that all information fetched to handle any request should be accessible very fast. This short article will show you few facts and tricks about symfony projects performance.

built-in foreign key management

One of the brilliant features in symfony is creating indexes for foreign keys by default. This saves a lot of time for the developers and, surely, leverages the overall performance of all symfony applications. Below is an example schema:

Book:
  actAs:
    Timestampable: ~
    SoftDelete: ~
  columns:
    category_id:
      type: integer
      notnull: true
      comment: "kategoria książek"
    title:
      type: string(255)
      notnull: true
      comment: "tytuł"
    author:
      type: string(255)
      comment: "autor"
    description:
      type: string
      comment: "opis"
  relations:
    Category:
      class: BookCategory
      local: category_id
      foreign: id
      foreignAlias: Books

BookCategory:
  actAs:
    Timestampable: ~
    SoftDelete: ~
  columns:
    name:
      type: string(255)
      notnull: true
      comment: "nazwa"

Such schema will generate the following SQL code. Note that the book.category_id column is indexed (faster queries) and constrainted (no data inconsistency) at the same time, automatically:

CREATE TABLE book_category (id BIGINT AUTO_INCREMENT ...;
CREATE TABLE book (id BIGINT AUTO_INCREMENT ...
  INDEX category_id_idx (category_id) ...;
ALTER TABLE book ADD CONSTRAINT book_category_id_book_category_id
  FOREIGN KEY (category_id) REFERENCES book_category(id);

Obviously, you may create your custom indexes (and this will be discussed later). This section from The symfony and Doctrine book covers the topic of custom indexes.

optimising MySQL queries

Before you do anything to speed up your queries execution, you need to know what queries you have in your system. The obvious thing is to look at the powerful web debug toolbar. It's a great tool, but it won't tell you what to do when executing your queries takes too long. But it can point out which queries are definitely poorly written (they are highlighted with orange color then). Then it's time for you to solve the problem. Often, it may happen that you need to join tables (more about this is written in "less doctrine queries" article).

If the number of queries cannot be limited, probably you may need to add custom indexes on your tables. An index is an additional structure, bound to a table, that speeds up selecting the appropriate results (there are lots of good tutorials on this topic, such as the tizag tutorials). The database server, when executing any query, looks for the best structure that can be used to serve the results as fast as possible. We can ask our database server to analyse given any query to tell us how is it going to be executed. And the best tool to visualise this is the EXPLAIN statement (short reference here). We will optimise a heavy query executed on the homepage of a social website, using explain and adding custom index.

example - social website homepage problem

The manager of the social website wants the developers to emphasise the users who are the most active ones. For example, he wants to display last logged users at the homepage. The developers figured out that they need to create an action table that will store actions performed by users. Action and profile tables are related to each other - a simple JOIN will be used each time when the homepage action is executed: last logged x profiles are fetched from the database and displayed then.

The website has been set off. Many users have registered and the action table is growing bigger and bigger every day. After few months, it has over 300'000 records. The manager is very happy that his project is becoming popular, but he noticed that the homepage is being served few seconds slower than it was in the beginning. The developers tell him that they didn't run high performance tests and they have to spend some time on optimisation. The manager is not pleased that it was not considered before.

NOTE: always use test data when focusing on project performance

Symfony has a built-in fixture mechanism which allows you to easily generate lots of different data (see the jobeet tutorial). This is essential when you want to make sure that your project will manage with big popularity. Anyway, if you decide to generate really big amount of data, do NOT use any ORM. It consumes too much memory and generating fixtures takes a lot of your precious time. I'd suggest to generate raw SQL INSERT queries instead - they'll be a lot faster.

Okay, let's move on. Once you have got lots of data (either real or test), execute each action - one after another - and check its performance. First thing you should look at is the mentioned web debug toolbar in the top right corner of the screen when running dev application environment. You should be worried, when you see something like the following:

There is a big problem: at least one of the queries is unoptimal (orange color) and as a consequence, executing this action takes too much time (imagine, almost 5 secs per one execution is really long and it doesn't matter that I'm using my personal computer for testing). Left click on the query icon (the one to the right):

One query takes almost 4 seconds to be executed. This surely causes a serious performance problem! Don't panic, just let your database server analyse the query for you:

EXPLAIN
  SELECT a.id AS a__id, a.created_by AS a__created_by
  FROM action a
  LEFT JOIN profile p ON
    (a.created_by = p.id AND p.deleted_at IS NULL )
  WHERE (
    a.type_id_external = '2'
    AND p.avatar_id IS NOT NULL
    AND p.mode =4
  )
  ORDER BY a.created_at DESC;

Here we can see, that the query has to check at least 1690 p (profile) table rows. And each profile record stores a lot of text data, describing each website user. All this makes the query take such long time to execute. If we want to speed it up, we just have to read carefully the query and concentrate on all columns used (and the order of their appearance). The solution is to find the best index (this topic may be quite complex and independent on the framework you use - ask google about indexing database tables/optimising databases - and read those articles carefully).

In this case, the developers executed the following line in MySQL:

ALTER TABLE profile ADD INDEX deleted_at_idx (deleted_at);

which created an index on the deleted_at datetime column. Thanks to this index, the EXPLAIN command shows that only 10 different rows in the profile table have to be analysed for the query to be executed. And this is a great success - the execution time went down to 0,01 second. Imagine the 4 seconds difference for each homepage display. This is the benefit from optimising project databases.

By the way, I cannot understand why the deleted_at column in the SoftDelete behavior is not indexed by default, especially when you can turn on automatic not-null-check (sql callback):

$manager->setAttribute(Doctrine_Core::ATTR_USE_DQL_CALLBACKS, true);

Provided this line is present in the project configuration, each query that fetches a SoftDelete model will automatically add the "WHERE deleted_at IS NOT NULL" part. It's obvious that such column has to be indexed - the index can be complex though - and the deleted_at can be the last column in this index - but, anyway, default index on deleted_at is a good idea! As you can see, you have to pay attention to what queries are executed inside your projects!

Note: database server different versions use different indexes

Different database server versions may use totally different indexes to execute the same queries on the same database structure! Make sure you run your performance test in the (future) production environment. Otherwise, you may find your application execute unoptimised queries on the production server even if you spent a lot of time optimising it in your local dev machine.

In the example above, it turned out that the production server has a different db server than the developer's local machine. The developer didn't check it - he was not aware of the differences and their negative impact on project's performance. The index that has been built is useless in the production environment (so it should be deleted, because inserting each row is slowed down by this index). Moreover, it happedened, that the new index the developers needed to speed up the query should be build on the action table... pay attention to database server versions you work on!

how many indexes to create

Table indexes are really helpful and they speed up database performance. The more complex your application is, the more queries may be executed. In other words, the bigger your application is, the more queries it may need to provide good performance. But look out - do not create too much indexes and never create any index when you are not sure that it is used somewhere. Why? It's very simple - each index is an additional structure which uses some space and needs some time to be created and managed. When a record is inserted, updated or removed, each index has to be updated accordingly to the change made, which surely consumes time. If you create too many indexes, you may simply slow down your database. For example, each user login inserts new action record - then an action table with 10 indexes will be slower than an action table with only 3 indexes).

Symfony world