Robust CSV file import with DictReader and chardet

Wondering how to import CSV file in Python? I've got you covered! Read on and learn how to do it with DictReader and chardet.

Wondering how to import CSV file in Python? I've got you covered! Read on and learn how to do it with DictReader and chardet.As easy as it seems, Python is a language of great opportunities and mastery that comes with a lot of practice.It has a lot of insanely useful libraries and csv (a member of which is the DictReader class) is definitely one of them.This will be an introductory post so you don't have to worry about your knowledge of Python. If you're looking for tips on how to start learning Python, we've got something for you as well.
‍

Import CSV file in Python: The absolute basics

You might ask: if it's that easy, then why should I even read this?Well… it's easy, but it might also be a bit confusing because of the amount of options available.Moreover, validating columns and detecting whether the file is a valid CSV file are not built-in functionalities of the csv library, so after the introduction I will describe those as well.As I mentioned earlier, parsing the file is pretty simple:[c]import csv # import the csv modulewith open('example.csv') as csv_file: # open example.csv as csv_file and iterate over rows reader = csv.DictReader(csv_file, delimiter=';') for row in reader: print row.get('Col 1') + ' ' + row.get('Col 2') # print out values of Col 1 and Col 2[/c]... and that's all!Yup, it's that easy. Well, at least the basics. We open the file, read all lines, get columns, profit!What you might want to do besides that is:

check whether that really is a CSV file or not
validate that only the columns you want are there
check the encoding
‍

Validation

csv has a thing called "Sniffer" that, given a portion of the file, checks whether it is valid or not (besides checking the dialect, it raises an exception when parsing an invalid file and that's probably what you're be looking for):[c]import csvwith open('example.csv') as csv_file: try: csv.Sniffer().sniff(csv_file.read(1024)) # take a 1024B (max) portion of the file and try to get the Dialect csv_file.seek(0) except csv.Error: print 'I did not expect the spanish inqusition!'[/c]If you want to check whether columns provided are what you expect, it gets a tiny bit trickier. Firstly, provide the fieldnames parameter for the DictReader instance (they will be used as keys for the dictionary):[c]import csvwith open('example.csv') as csv_file: reader = csv.DictReader(csv_file, delimiter=';', fieldnames=['Col 1', 'Col 2']) for row in reader: print row.get('Col 1') + ' ' + row.get('Col 2')[/c]Running that example you'll notice, that the first row -- containing the header -- is printed as well; in general we don't want that to happen.Let's change this:[c]import csvwith open('example.csv') as csv_file: is_first_row = True reader = csv.DictReader(csv_file, delimiter=';', fieldnames=['Col 1', 'Col 2']) for row in reader: if is_first_row: is_first_row = False continue # skip the row print row.get('Col 1') + ' ' + row.get('Col 2')[/c]Now that we skipped the header, let's get back to it to check whether it has the column set we want.Before we do it, I'll define a small helper function:[c]from itertools import chaindef flatten_list(nested_list): return list(chain(*[item if isinstance(item, list) else [item] for item in nested_list]))[/c]Don't worry, it's not as complicated as it seems to be.We take each element of the list (item (...) for item in nested_list), check if it's a list and if it isn't, we make a list out of it and join all lists into a single one, that'll give us a nice flattened list (i.e. for [1,2,[3,4]] we'll get [1,2,3,4]).Now we can improve our import yet again:[c]import csvfrom itertools import chaindef flatten_list(nested_list): return list(chain(*[item if isintance(item, list) else [item] for item in nested_list]))with open('example.csv') as csv_file: is_first_row = True valid_columns = ['Col 1', 'Col 2'] reader = csv.DictReader(csv_file, delimiter=';', fieldnames=valid_columns) for row in reader: if is_first_row: current_columns = flatten_list(row.values()) # ^-- if there are columns we don't want, row.values() will return ['Col 1', 'Col 2', ['Col 3', 'Col 4', (...)]] if set(valid_columns) != set(current_columns): # ^-- compare sets, because when comparing arrays order is important as well print 'This is not the file I expected! I quit!' break is_first_row = False continue print row.get('Col 1') + ' ' + row.get('Col 2')[/c]
‍

Encoding detection

As you probably now already, text file content can be represented using different encodings, for example UTF-8, windows-1250, iso-8859-2 etc. In some cases we want to detect that encoding and decode strings so that we can parse them the way we want.To do that, we'll use the chardet library (it isn't available by default, so you need to use either pip or easy_install to get it).Doing it is (again) pretty easy and what you need to do is to read file content and then pass it to chardet.detect():[c]import chardetcsv_file_raw = csv_file.read()encoding = chardet.detect(csv_file_raw)['encoding']if not encoding: print 'No encoding found for the file! Is it valid?'if 'UTF' in encoding: encoding = encoding.replace('-').lower()[/c]The last two lines (replacing '-' if the encoding is UTF-* and then making it lowerscore are necessary if you want to decode a string using that information. To do that, I'll define the last helper function:[c]def string_to_utf8(string, source_encoding): if source_encoding == 'utf8': return string else: return string.decode(source_encoding).encode("utf8")[/c]Here we assume the target encoding is utf8 (UTF-8), and if the string isn't in that format, we simply decode it and encode again using UTF-8.Now let's put together all things we discussed here:[c]import csvimport chardetfrom itertools import chaindef string_to_utf8(string, source_encoding): if source_encoding == 'utf8': return string else: return string.decode(source_encoding).encode("utf8")def string_list_to_utf8(string_list): return [string_to_utf8(element) for element in string_list]def flatten_list(nested_list): return list(chain(*[item if isinstance(item, list) else [item] for item in nested_list]))with open('example.csv') as csv_file: csv_file_content = csv_file.read() encoding = chardet.detect(csv_file_content)['encoding'] if not encoding: print 'No encoding found for the file! Is it valid?' if 'UTF' in encoding: encoding = encoding.replace('-').lower() is_first_row = True valid_columns = ['Col 1', 'Col 2'] reader = csv.DictReader(csv_file, delimiter=';', fieldnames=valid_columns) for row in reader: if is_first_row: current_columns = string_list_to_utf8(flatten_list(row.values())) if set(valid_columns) != set(current_columns): print 'This is not the file I expected! I quit!' break is_first_row = False continue print string_to_utf8(row.get('Col 1')) + ' ' + string_to_utf8(row.get('Col 2'))[/c]As you can see, I added (now really the last) helper function, that changes encoding of columns we want to validate to utf8 to rule out the possibility of our code crashing on the string compare (if valid columns are unicode strings).And that's really all there is to import CSV file in Python! Isn't that awesome?
‍

Summary

As you can see, Python is a great language that enables you to solve problems faster and at a lower initial cost.I hope my code helps you in one way or another as much as it helped me while working on one of our projects. Have a nice day!

>>>Ready to start your project?

Let's discuss how we can help bring your ideas to life.

Contact

Company Information

Follow Us