Drupal 11: Using The Batch API To Process CSV Files

This is the fourth article in a series of articles about the Batch API in Drupal. The Batch API is a system in Drupal that allows data to be processed in small chunks in order to prevent timeout errors or memory problems.

So far in this series we have looked at creating a batch process using a form, followed by creating a batch class so that batches can be run through Drush and then using the finished state to control batch processing. All of these articles go together to form a basis of batch processing in Drupal.

In this article we will look at bringing these concepts together to perform a task that is quite common on websites, processing Comma Separated Value or CSV files, which we will do using the Drupal Batch API.

Processing CSV data is very common on the web. Whilst it is quite common to integrate with an API it is sometimes simpler to just generate a CSV file of the data you need from one system and upload it into a form on a website. Most systems will allow you to export data as a CSV file of some kind, but it's also very easy to export a CSV from commonly used programs like Excel and Google Sheets.

Processing a CSV file in PHP is quite simple, but once you reach one hundred records you will find that PHP will start to throw errors due to timeouts or memory issues. The solution is to employ the Batch API to spread the load of that processing over a number of different requests.

I have seen a few examples of batch processing CSV files and they often get things wrong by processing the entire CSV file into an array at the start of the batch process. In the examples seen here we will process the CSV using the file pointer functions built into PHP and use the finished property to inform Drupal that the file has finished processing.

The CSV File

This is the CSV file that we will use in the examples below. This CSV file contains a random text for a title in column 1, and a random sentence in column 2.

jcDBBwa4tjlCOWv,"Ille ludus paulatim quia saepius sit tincidunt."
rFm8ZPDkeyV39rt,"Conventio decet molior nisl ratis sagaciter sudo suscipit typicus venio."
v5At5x5fYL38dh8,"Magna melior odio quae."
g3Wx7mESABwh5ap,"Caecus consequat cui esse gemino gravis hendrerit nostrud probo quibus."
zRz0gHbRsTl3EXv,"Haero refero te."
rlCXRe5zoxBSZLM,"Erat interdico molior qui."

We will be generating pages of the Article content type, which we get with the standard install profile in Drupal. Column 1 will be used for the title of the Article and column 2 will be used for the body copy of the article.

The File Upload Form

In order to allow our users to upload a CSV file we need to create a file upload form that will then trigger a batch run.

Here is the form definition class and the buildForm() method. The form itself consists of a "file" field where we can upload our CSV file, and a "submit" field where we can submit the form.

<?php

namespace Drupal\batch_csv_example\Form;

use Drupal\batch_csv_example\Batch\BatchClass;
use Drupal\Core\Batch\BatchBuilder;
use Drupal\Core\Form\FormBase;
use Drupal\Core\Form\FormStateInterface;
use Drupal\Core\Url;
use Drupal\file\Entity\File;

class BatchForm extends FormBase {

  public function getFormId() {
    return 'batch_csv_example';
  }
  
  public function buildForm(array $form, FormStateInterface $form_state) {
    $form['csv_file'] = [
      '#type' => 'file',
      '#title' => $this->t('The CSV file to process'),
    ];

    $form['actions'] = [
      '#type' => 'actions',
      'submit' => [
        '#type' => 'submit',
        '#value' => $this->t('Run batch'),
      ],
    ];

    return $form;
  }
}

When the form is submitted we first trigger the validateForm() method, which is where we can copy the file from the request into temporary storage. The copying of the file data from the request is done using the file_save_upload() function, which is part of the Drupal file system. We include the FileExtension validator as the second parameter to ensure that the file uploaded is a CSV file.

The result of file_save_upload() is a entity of type \Drupal\file\Entity\File (or false if the file failed validation or failed to copy). If everything worked then we set a the entity as a value in the form state called "temp_csv_file", which we can pick up and process in the submit handler.

  public function validateForm(array &$form, FormStateInterface $form_state): void {
    parent::validateForm($form, $form_state);

    // Create a temporary file entity based on the file form element.
    // Note file_save_upload() might soon be deprecated.
    // https://www.drupal.org/project/drupal/issues/3375423
    $tempFileEntity = file_save_upload(
      'csv_file',
      [
        'FileExtension' => ['extensions' => 'csv'],
      ],
      FALSE,
      0,
    );

    if (!$tempFileEntity instanceof File) {
      $form_state->setErrorByName('csv_file', $this->t('Upload failed.'));
      return;
    }

    // Keep the temporary file entity available for submit handler.
    $form_state->setValue('temp_csv_file', $tempFileEntity);
  }

Note that the file_save_upload() function will be removed at some point in the not too distant future, probably being deprecated in Drupal 11 and removed in Drupal 12. As an alternative service hasn't been developed yet we need to use file_save_upload() here. I've added a link to the issue regarding the replacement of this function in case you want to follow that progress.

We are saving the CSV as a temporary file here so that it will be cleaned out automatically at some point in the future. The location of the temporary file directory and the frequency of cleaning temporary files are configurable within Drupal. If you are on a load balanced system you will want to ensure that the temporary directory is available to all web nodes or you will get file missing errors as you attempt to process them in a batch. 

With that in place the submit handler will now have access to the temporary file we created in the validation handler.

Starting The Batch Operation

To process the CSV file we need to trigger a batch process. The setup of the batch run here is explained in an introduction to batch processing with the Batch API, so please check that article out if you want more information.

All we need to do here is pass the path of the file to the batch processing method as we will figure out the rest in the processing method itself. We are only creating a single operation that we will call over and over again until the file has been completely processed.

  public function submitForm(array &$form, FormStateInterface $form_state): void {
    $tempFile = $form_state->getValue('temp_csv_file');

    $batch = new BatchBuilder();
    $batch->setTitle('Running batch process.')
      ->setFinishCallback([BatchClass::class, 'batchFinished'])
      ->setInitMessage('Commencing')
      ->setProgressMessage('Processing...')
      ->setErrorMessage('An error occurred during processing.');

    // Process the CSV as a batch.
    $batch->addOperation([BatchClass::class, 'batchProcess'], [$tempFile->getFileUri()]);

    batch_set($batch->toArray());

    $form_state->setRedirectUrl(new Url($this->getFormId()));
  }

Note that we can't pass the $tempFile entity here as the batch operations use the queue system and that doesn't allow for complex objects to be stored as queue properties. We therefore pass the filename instead, which we can use for everything we need.

Processing The CSV File

The batchProcess() method is the single operation we setup in the form submission handle and this is where we process the CSV file to create the Articles in our Drupal site.

The first thing we need to do is set up our sandbox and results array. The results are the normal updated/skipped/failed etc items that we set up in the other articles in this series. The sandbox is where we set up properties to track the progress of the file processing. The "seek" property is used to figure out how far along the file we are. What we will be doing here is using the fseek() PHP function to find the current place in the CSV file, so we need to keep track of the seek parameter in our sandbox.

public static function batchProcess(string $fileName, array &$context): void {
    if (!isset($context['sandbox']['progress'])) {
      $context['sandbox']['progress'] = 0;
      $context['sandbox']['seek'] = 0;
    }
    if (!isset($context['results']['updated'])) {
      $context['results']['updated'] = 0;
      $context['results']['skipped'] = 0;
      $context['results']['failed'] = 0;
      $context['results']['progress'] = 0;
      $context['results']['process'] = 'CSV batch completed';
    }
    
    // - Continue batch processing...
}

The first thing we do in our batch process is to get the size of the file using the filesize() PHP function. This returns the filesize in bytes, which will be important as we process the file. The message we send back to the user is reporting the filesize in kilobytes for readability, so we need to do a quick conversion as we populate the message arguments.

$filesize = filesize($fileName);

// Message above progress bar.
$percent = round(($context['sandbox']['seek'] / $filesize) * 100);
$context['message'] = t('Processing file, @seek of @filesize complete (@percentage%).', [
  '@seek' => number_format($context['sandbox']['seek'] / 1024) . 'kb',
  '@filesize' => number_format($filesize / 1024) . 'kb',
  '@percentage' => $percent,
]);

We aren't going to be processing the entire CSV file in one go, so we need to add a limit for the number of items that we want to process in a single batch. A second variable called $count is used to count how many items we have processed. If the count reaches the limit then we stop the CSV processing and pick it up in the next batch run.

// How many lines to process at once?
$limit = 50;

// Keep track of how many lines we have processed in this batch.
$count = 0;

These two variables could be reduced down to a single item (i.e. the limit), but I think it's clearer what is going on here. Ultimately, this is down to coding style and how you want to count down the number of items processed.

The next part is the important part of the batch process. The seek property in our sandbox tells us where the file processing is currently at, so we use the fseek() PHP function to move the file pointer to the correct point in the CSV file. This means that when we read data out of the file it will be at the correct place and will extract data we haven't already read.

The first time we call the this batch function the seek property will be 0, but as we finish the batch process method the seek property is updated. This means that the next time the batch process is called we will advance to a different place in the CSV file.

fseek($handle, $context['sandbox']['seek']);
while ($line = fgetcsv($handle, 4096)) {
  // Validate the CSV.
  if (count($line) !== 2) {
    // The line in the CSV file won't import correctly. So skip this line.
    $context['results']['skipped']++;
    continue;
  }

  // Extract the data from the CSV file and create the Article here.

  $count++;
  if ($count >= $limit) {
    // We have reached the limit for this run, break out of the loop.
    // If we have more file to process then we will run the batch
    // function again.
    break;
  }
}

The fgetcsv() PHP function will extract a line out of the CSV file and store it as an array in the $line variable. As we are expecting the $line array to have 2 items we can perform some simple validation to prevent any errors being thrown as we try to use invalid lines.

At this point, the Article creation code can be called, which just feeds the CSV data (in the $line variable) into the Node::create() method. This creates a Node object, which we can save to the database using the save() method.

// Process the CSV item.
$node = Node::create([
  'type' => 'article',
  'title' => $line[0],
  'body' => [
    'value' => '<p>' . $line[1] . '</p>',
    'format' => filter_default_format(),
  ],
  'uid' => 1,
  'status' => 1,
]);
$node->save();

Once we have finished this batch process we need to find out where our file pointer is. We do this using the ftell() PHP function, which we then store in the seek property in our sandbox. This means that when the next batch process begins it will start at this point in the file, picking up where it left off.

// Update the position of the pointer.
$context['sandbox']['seek'] = ftell($handle);

Here is the complete file processing code of the batch process method.

if ($handle = fopen($fileName, 'r')) {
  fseek($handle, $context['sandbox']['seek']);
  while ($line = fgetcsv($handle, 4096)) {

    $context['results']['progress']++;

    // Validate the CSV.
    if (count($line) !== 2) {
      // The line in the CSV file won't import correctly. So skip this line.
      $context['results']['skipped']++;
      continue;
    }

    // Process the CSV item.
    $node = Node::create([
      'type' => 'article',
      'title' => $line[0],
      'body' => [
        'value' => '<p>' . $line[1] . '</p>',
        'format' => filter_default_format(),
      ],
      'uid' => 1,
      'status' => 1,
    ]);
    $node->save();

    $context['results']['updated']++;

    $count++;
    if ($count >= $limit) {
      // We have reached the limit for this run, break out of the loop.
      // If we have more file to process then we will run the batch
      // function again.
      break;
    }
  }
  // Update the position of the pointer.
  $context['sandbox']['seek'] = ftell($handle);

  // Close the file handle.
  fclose($handle);
}

Finally, the last thing we do in our batch process method is to update the 'finished' property. The finished property informs Drupal of the current state of the batch run, if the value is less than 1 then the batch process method will be called again, otherwise the batch will be stopped and the finish callback will be triggered.

By using the seek property and the file size we can determine when we have reached the end of the file processing and tell Drupal to stop the batch processing by dividing one by the other.

// Update the finished parameter.
$context['finished'] = $context['sandbox']['seek'] / $filesize;

This concludes the batch process method. Using these lines of code we can process a CSV file of great length and not exceed the timeouts or memory limits of PHP.

Here is a screenshot of the CSV batch operation in action, showing 15kb of the total 59kb file having been processed.

The CSV batch process running, showing 15kb of the total 59kb file having been processed.

To break down what is going on here in detail, let's say we had a 29kb CSV file. The process would proceed as follows.

  • The first time the process method is run the seek property is set to 0. 
    • We then use fseek() to set the file pointer of the CSV file to 0 and read off the first 50 rows of data, creating 50 Article nodes. 
    • As we finish the first run the seek property is set to 3016 bytes and we finish the process.
    • We calculate the finish property by dividing 3016 bytes 100kb, which is much less than 1 and so the process call is triggered again.
  • The process method is triggered a second time, and we use the fseek() function to move the pointer to the value of the seek property, which is 3016 bytes.
    • We then continue to process the CSV file from that point onward until we hit the limit of 50 items.
    • We update the seek parameter to be 5770 bytes, which is still less than the file size, so the process method is called again.
  • The process method is called a third time, and again we use the value of seek in our sandbox property to move the file pointer using the fseek() function.
    • 50 more items are processed.
    • The seek property is then set to 8684 bytes.
  • We keep processing the CSV file until the last run.
  • The last time we call the process method the fgetcsv() method will return false, which will break out of the loop of processing the CSV file. When we get the file pointer location it will be 29,924, which is the same value as our file size. When we calculate the finished property it will equal 1 and the batch operation is completed.

Processing a 29kb CSV file into Articles will create 500 articles. 

Conclusion

I have run this CSV processing code with up to 60k items in the file and not seen a single issue in processing the data. That processing had a smaller batch size as there were a few validation and data handling steps to perform for each record, but the central ideas of using fseek() and ftell() to track the progress of the file pointer where key principles in this technique.

When researching how to process CSV files using the Batch API I found a few examples that would extract all of the data from the CSV file right at the start of the process and create a number of batch operations based on the number of items in the CSV. This meant that if the file was large then the initialisation of the batch process would actually time out and fail before the process had begun. By sending the filename to the batch process and then nibbling away at the file in small chunks we remove any setup problems and can still process large CSV files without any slowdown.

All the code for this article is available on GitHub, along with the source code for all of the other examples in this series. The batch_csv_example module also contains a controller action that can be used to download a CSV file to process in the batch form. This controller action is linked from the form itself so you can easily find it. Be careful about using this module on your own sites as it will create 1,000 Articles in just a few seconds and I haven't created a way to remove them!

In the next article in this series we will be looking at adding additional operations to already running batch processes. If you are looking for any particular questions to be answered about the Batch API then please get in touch.

More in this series

Add new comment

The content of this field is kept private and will not be shown publicly.