Blog of Adrian Anttila, containing my thoughts, comments and questions. RSS Feed


Excel and CSV Reference

I've been doing a lot of work with CSV and Excel imports recently. There are a lot of poorly documented or disparate pieces of information that are required to interface with a CSV or Excel data source, so I'm recording it here.

Formatting Overview

There are a few formatting rules that you need to consider when working with CSV files.

  • Double-quotes are escaped with two double-quotes: " becomes ""
  • Fields with reserved characters are escaped with double-quotes; this includes carriage-return and new-line characters

Using a StreamReader is probably a no-no

For simple files, using a StreamReader will work fine. However, if you have any fields that contain carriage-returns or line-feeds, ReadLine() won't get you the full set of data for a single record. I suppose that you could try and read subsequent lines until you read the expected number of fields, but that brings up the next problem: commas. Fields can contain commas, so calling Split() with a comma as the parameter can produce unexpected results.

As an aside, a StreamReader won't work for native Excel files, so you would be stuck with a hybrid approach. So what do you do? Use OLE DB.

System.Data.OleDb to the rescue

There are OLE DB providers for CSV, Excel 97-2003, and Excel 2007. The CSV and Excel 97-2003 drivers seem to be available with either Windows, the Microsoft Data Access Components, or the .NET Framework. The Excel 2007 drivers are part of the 2007 Office System Driver: Data Connectivity Components, a separate download.

Each file format requires a slightly different connection string, as shown below.

CSV

Provider=Microsoft.Jet.OLEDB.4.0;
Data Source={0};
Extended Properties="text;HDR=Yes;FMT=Delimited;";

Excel 97-2003

Provider=Microsoft.Jet.OLEDB.4.0;
Data Source={0};
Extended Properties="Excel 8.0;HDR=Yes;IMEX=1;";

Excel 2007

Provider=Microsoft.ACE.OLEDB.12.0;
Data Source={0};
Extended Properties="Excel 12.0;HDR=Yes;IMEX=1;";

Note: The data source for CSV is the directory where the file can be found, not the full path to the file.

HDR=Yes is required if the first line/row in the file has column names, aka, header row. I recommend that you use header rows whenever possible so you're not depending on the column ordering.

IMEX=1 indicates that columns may have mixed data. This is important in situations where a column may contain both text and numerical data. Without IMEX=1, mixed columns default to decimal, and string values are read as null values. By default, the OLE DB provider only reads the first 8 rows to determine a column's data type when IMEX=1 was specified. To increase the number of rows examined, change the TypeGuessRows value in the HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engines\Excel and HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office\12.0\Access Connectivity Engine\Engines\Excel sections in your registry. Changing the value to 0 will force all columns to be read before the column type will be determined.

The easiest way to determine which connection string to use is to initialize a new System.IO.FileInfo instance with the name of the file you're working with, and use its Extension property.

Querying for data

With CSV, you query against files, not tables. With Excel (any version), you query against sheets, not tables. However, either format works with SQL queries.

The easiest way to translate the file or sheet name into a table name is to use call OleDbConnection.GetOleDbSchemaTable. GetOleDbSchemaTable returns a DataTable with schema information for the OLEDB data source.

DataTable schemaDataTable = connection.GetOleDbSchemaTable(OleDbSchemaGuid.Tables, null);
foreach (DataRow dataRow in schemaDataTable.Rows)
{
    string tableName = (string) dataRow["TABLE_NAME"];
    // Do something important here
}

The other important thing to consider is that the names of columns, files, and sheets can contain spaces. When you build your queries, it's a good idea to escape everything with brackets.

    string commandText = String.Format("SELECT [First Name], [Last Name], [E-Mail Address] FROM [{0}]", tableName);

Example

If a header row is present, you can call GetOrdinal to determine the index of a column by its name. Here is a helper method, GetDataReaderString, to read the column value as a string.

private string GetDataReaderString(IDataReader dataReader, int ordinal)
{
    if (dataReader.IsDBNull(ordinal))
    {
        return null;
    }
    else
    {
        return dataReader.GetValue(ordinal).ToString();
    }
}

Here's an example that reads a file with a header row.

using (OleDbConnection connection = new OleDbConnection(connectionString))
{
    connection.Open();
    
    DataTable schemaDataTable = connection.GetOleDbSchemaTable(OleDbSchemaGuid.Tables, null);
    foreach (DataRow dataRow in schemaDataTable.Rows)
    {
        string tableName = (string) dataRow["TABLE_NAME"];

        if ((extension == ".csv" && tableName.Contains("YourFileNameWithoutExtension")) || extension == ".xls" || extension == ".xlsx")
        {
            // Build the command text
            string commandText = String.Format("SELECT [First Name], [Last Name], [E-Mail Address] FROM [{0}]", tableName);

            // Issue the command and process the results
            using (OleDbCommand command = new OleDbCommand(commandText, connection))
            {
                using (OleDbDataReader dataReader = command.ExecuteReader())
                {
                    int firstNameOrdinal = dataReader.GetOrdinal("First Name");
                    int lastNameOrdinal = dataReader.GetOrdinal("Last Name");
                    int emailOrdinal = dataReader.GetOrdinal("E-Mail Address");

                    while (dataReader.Read())
                    {
                        string firstName = GetDataReaderString(dataReader, firstNameOrdinal);
                        string lastName = GetDataReaderString(dataReader, lastNameOrdinal);
                        string email = GetDataReaderString(dataReader, emailOrdinal);

                        // Do something important here
                    }
                }
            }

            // Only process the first sheet of an Excel 97-2003 or Excel 2007 file; or, only process the CSV that was uploaded by the member
            break;
        }
    }
}

Hope this helps!

 
Posted by Adrian Anttila | 5 Comments | Trackback Url | Bookmark with:        
Tags:

Links to this Post

Comments

Thursday, 17 Apr 2008 04:06 by csv to Linq
have you seen this? http://www.codeproject.com/KB/linq/LINQtoCSV.aspx

Saturday, 3 May 2008 02:57 by Thanks!
I was running into issues w/ .CSV vs. .XLS. Your page does a good job of explaining the difference!

Wednesday, 7 May 2008 03:36 by RE: csv to Linq
I did look into using it, but that would have forced me to support 2 different code paths: one for CSV and one for Excel. In the end, taking the OleDb allowed for a simpler codebase.

Thursday, 21 Aug 2008 10:45 by Charles Rex
Hello, Is there a IMEX=1 version for CSV drivers, without going to the registry ? See this too http://www.dailydoseofexcel.com/archives/2004/06/03/external-data-mixed-data-types/ In the registry key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engines\Text I must set MaxScanRows = 0 instead of MaxScanRows = 25. The problem is that I don't want to do that in the registry, I want the details to be set in my application ! I have both strings and numeric data in one column, and I get unexpected results, some of data doesn’t show up. e.g. If the column in the CSV file is like this: aaa bbb 111 222 333 444 555 666 777 etc is seen after SELECT as empty value(null) empty value(null) 111 222 333 444 555 666 777 etc My connection string looks exactly like yours: (I'm using C# as you do) Provider=Microsoft.Jet.OLEDB.4.0; Data Source={0}; Extended Properties="text;HDR=Yes;FMT=Delimited;"; Thank you

Thursday, 21 Aug 2008 11:17 by Charles Rex
After reading http://msdn.microsoft.com/en-us/library/ms974559.aspx I decided to create a schema.ini file in the same folder with the CSV file The schema.ini should mention the CSV file name enclosed in square brackets e.g. [MyFile.csv] MaxScanRows=0 If I do this, things are ok. Now have both strings and numeric data in the column.

Name:
URL:
Email:
Comments:

CAPTCHA Image Validation