I've been doing a lot of work with CSV and Excel imports recently. There are a lot of poorly documented or disparate pieces of information that are required to interface with a CSV or Excel data source, so I'm recording it here.
Formatting Overview
There are a few formatting rules that you need to consider when working with CSV files.
- Double-quotes are escaped with two double-quotes: " becomes ""
- Fields with reserved characters are escaped with double-quotes; this includes carriage-return and new-line characters
Using a StreamReader is probably a no-no
For simple files, using a StreamReader will work fine. However, if you have any fields that contain carriage-returns or line-feeds, ReadLine() won't get you the full set of data for a single record. I suppose that you could try and read subsequent lines until you read the expected number of fields, but that brings up the next problem: commas. Fields can contain commas, so calling Split() with a comma as the parameter can produce unexpected results.
As an aside, a StreamReader won't work for native Excel files, so you would be stuck with a hybrid approach. So what do you do? Use OLE DB.
System.Data.OleDb to the rescue
There are OLE DB providers for CSV, Excel 97-2003, and Excel 2007. The CSV and Excel 97-2003 drivers seem to be available with either Windows, the Microsoft Data Access Components, or the .NET Framework. The Excel 2007 drivers are part of the 2007 Office System Driver: Data Connectivity Components, a separate download.
Each file format requires a slightly different connection string, as shown below.
|
CSV |
Provider=Microsoft.Jet.OLEDB.4.0; Data Source={0}; Extended Properties="text;HDR=Yes;FMT=Delimited;"; |
|
Excel 97-2003 |
Provider=Microsoft.Jet.OLEDB.4.0; Data Source={0}; Extended Properties="Excel 8.0;HDR=Yes;IMEX=1;"; |
|
Excel 2007 |
Provider=Microsoft.ACE.OLEDB.12.0; Data Source={0}; Extended Properties="Excel 12.0;HDR=Yes;IMEX=1;"; |
Note: The data source for CSV is the directory where the file can be found, not the full path to the file.
HDR=Yes is required if the first line/row in the file has column names, aka, header row. I recommend that you use header rows whenever possible so you're not depending on the column ordering.
IMEX=1 indicates that columns may have mixed data. This is important in situations where a column may contain both text and numerical data. Without IMEX=1, mixed columns default to decimal, and string values are read as null values. By default, the OLE DB provider only reads the first 8 rows to determine a column's data type when IMEX=1 was specified. To increase the number of rows examined, change the TypeGuessRows value in the HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engines\Excel and HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office\12.0\Access Connectivity Engine\Engines\Excel sections in your registry. Changing the value to 0 will force all columns to be read before the column type will be determined.
The easiest way to determine which connection string to use is to initialize a new System.IO.FileInfo instance with the name of the file you're working with, and use its Extension property.
Querying for data
With CSV, you query against files, not tables. With Excel (any version), you query against sheets, not tables. However, either format works with SQL queries.
The easiest way to translate the file or sheet name into a table name is to use call OleDbConnection.GetOleDbSchemaTable. GetOleDbSchemaTable returns a DataTable with schema information for the OLEDB data source.
DataTable schemaDataTable = connection.GetOleDbSchemaTable(OleDbSchemaGuid.Tables, null);
foreach (DataRow dataRow in schemaDataTable.Rows)
{
string tableName = (string) dataRow["TABLE_NAME"];
// Do something important here
}
The other important thing to consider is that the names of columns, files, and sheets can contain spaces. When you build your queries, it's a good idea to escape everything with brackets.
string commandText = String.Format("SELECT [First Name], [Last Name], [E-Mail Address] FROM [{0}]", tableName);
Example
If a header row is present, you can call GetOrdinal to determine the index of a column by its name. Here is a helper method, GetDataReaderString, to read the column value as a string.
private string GetDataReaderString(IDataReader dataReader, int ordinal)
{
if (dataReader.IsDBNull(ordinal))
{
return null;
}
else
{
return dataReader.GetValue(ordinal).ToString();
}
}
Here's an example that reads a file with a header row.
using (OleDbConnection connection = new OleDbConnection(connectionString))
{
connection.Open();
DataTable schemaDataTable = connection.GetOleDbSchemaTable(OleDbSchemaGuid.Tables, null);
foreach (DataRow dataRow in schemaDataTable.Rows)
{
string tableName = (string) dataRow["TABLE_NAME"];
if ((extension == ".csv" && tableName.Contains("YourFileNameWithoutExtension")) || extension == ".xls" || extension == ".xlsx")
{
// Build the command text
string commandText = String.Format("SELECT [First Name], [Last Name], [E-Mail Address] FROM [{0}]", tableName);
// Issue the command and process the results
using (OleDbCommand command = new OleDbCommand(commandText, connection))
{
using (OleDbDataReader dataReader = command.ExecuteReader())
{
int firstNameOrdinal = dataReader.GetOrdinal("First Name");
int lastNameOrdinal = dataReader.GetOrdinal("Last Name");
int emailOrdinal = dataReader.GetOrdinal("E-Mail Address");
while (dataReader.Read())
{
string firstName = GetDataReaderString(dataReader, firstNameOrdinal);
string lastName = GetDataReaderString(dataReader, lastNameOrdinal);
string email = GetDataReaderString(dataReader, emailOrdinal);
// Do something important here
}
}
}
// Only process the first sheet of an Excel 97-2003 or Excel 2007 file; or, only process the CSV that was uploaded by the member
break;
}
}
}
Hope this helps!