Technology
minutes read

Robust CSV file import with DictReader and chardet

Written by
Krzysztof Marciniak
Published on
May 7, 2018
TL;DR

Wondering how to import CSV file in Python? I've got you covered! Read on and learn how to do it with DictReader and chardet.

Author
Krzysztof Marciniak
Python Developer
My LinkedIn
Dowload 2024 SaaS Report
By subscribing you agree to our Privacy Policy.
Thank you! Your submission has been received
Oops! Something went wrong while submitting the form.
Share

Wondering how to import CSV file in Python? I've got you covered! Read on and learn how to do it with DictReader and chardet.As easy as it seems, Python is a language of great opportunities and mastery that comes with a lot of practice.It has a lot of insanely useful libraries and csv (a member of which is the DictReader class) is definitely one of them.This will be an introductory post so you don't have to worry about your knowledge of Python. If you're looking for tips on how to start learning Python, we've got something for you as well.

Import CSV file in Python: The absolute basics

You might ask: if it's that easy, then why should I even read this?Well… it's easy, but it might also be a bit confusing because of the amount of options available.Moreover, validating columns and detecting whether the file is a valid CSV file are not built-in functionalities of the csv library, so after the introduction I will describe those as well.As I mentioned earlier, parsing the file is pretty simple:[c]import csv                                   # import the csv modulewith open('example.csv') as csv_file:        # open example.csv as csv_file and iterate over rows    reader = csv.DictReader(csv_file, delimiter=';')    for row in reader:        print row.get('Col 1') + ' ' + row.get('Col 2') # print out values of Col 1 and Col 2[/c]... and that's all!Yup, it's that easy. Well, at least the basics. We open the file, read all lines, get columns, profit!What you might want to do besides that is:

  • check whether that really is a CSV file or not
  • validate that only the columns you want are there
  • check the encoding

Validation

csv has a thing called "Sniffer" that, given a portion of the file, checks whether it is valid or not (besides checking the dialect, it raises an exception when parsing an invalid file and that's probably what you're be looking for):[c]import csvwith open('example.csv') as csv_file:    try:        csv.Sniffer().sniff(csv_file.read(1024))  # take a 1024B (max) portion of the file and try to get the Dialect        csv_file.seek(0)    except csv.Error:        print 'I did not expect the spanish inqusition!'[/c]If you want to check whether columns provided are what you expect, it gets a tiny bit trickier. Firstly, provide the fieldnames parameter for the DictReader instance (they will be used as keys for the dictionary):[c]import csvwith open('example.csv') as csv_file:    reader = csv.DictReader(csv_file, delimiter=';', fieldnames=['Col 1', 'Col 2'])    for row in reader:        print row.get('Col 1') + ' ' + row.get('Col 2')[/c]Running that example you'll notice, that the first row -- containing the header -- is printed as well; in general we don't want that to happen.Let's change this:[c]import csvwith open('example.csv') as csv_file:    is_first_row = True    reader = csv.DictReader(csv_file, delimiter=';', fieldnames=['Col 1', 'Col 2'])    for row in reader:        if is_first_row:            is_first_row = False            continue   # skip the row        print row.get('Col 1') + ' ' + row.get('Col 2')[/c]Now that we skipped the header, let's get back to it to check whether it has the column set we want.Before we do it, I'll define a small helper function:[c]from itertools import chaindef flatten_list(nested_list):        return list(chain(*[item if isinstance(item, list) else [item] for item in nested_list]))[/c]Don't worry, it's not as complicated as it seems to be.We take each element of the list (item (...) for item in nested_list), check if it's a list and if it isn't, we make a list out of it and join all lists into a single one, that'll give us a nice flattened list (i.e. for [1,2,[3,4]] we'll get [1,2,3,4]).Now we can improve our import yet again:[c]import csvfrom itertools import chaindef flatten_list(nested_list):        return list(chain(*[item if isintance(item, list) else [item] for item in nested_list]))with open('example.csv') as csv_file:    is_first_row = True    valid_columns = ['Col 1', 'Col 2']    reader = csv.DictReader(csv_file, delimiter=';', fieldnames=valid_columns)    for row in reader:        if is_first_row:            current_columns = flatten_list(row.values())            # ^-- if there are columns we don't want, row.values() will return ['Col 1', 'Col 2', ['Col 3', 'Col 4', (...)]]            if set(valid_columns) != set(current_columns):                # ^-- compare sets, because when comparing arrays order is important as well                print 'This is not the file I expected! I quit!'                break            is_first_row = False            continue        print row.get('Col 1') + ' ' + row.get('Col 2')[/c]

Encoding detection

As you probably now already, text file content can be represented using different encodings, for example UTF-8, windows-1250, iso-8859-2 etc. In some cases we want to detect that encoding and decode strings so that we can parse them the way we want.To do that, we'll use the chardet library (it isn't available by default, so you need to use either pip or easy_install to get it).Doing it is (again) pretty easy and what you need to do is to read file content and then pass it to chardet.detect():[c]import chardetcsv_file_raw = csv_file.read()encoding = chardet.detect(csv_file_raw)['encoding']if not encoding:    print 'No encoding found for the file! Is it valid?'if 'UTF' in encoding:    encoding = encoding.replace('-').lower()[/c]The last two lines (replacing '-' if the encoding is UTF-* and then making it lowerscore are necessary if you want to decode a string using that information. To do that, I'll define the last helper function:[c]def string_to_utf8(string, source_encoding):    if source_encoding == 'utf8':        return string    else:        return string.decode(source_encoding).encode("utf8")[/c]Here we assume the target encoding is utf8 (UTF-8), and if the string isn't in that format, we simply decode it and encode again using UTF-8.Now let's put together all things we discussed here:[c]import csvimport chardetfrom itertools import chaindef string_to_utf8(string, source_encoding):    if source_encoding == 'utf8':        return string    else:        return string.decode(source_encoding).encode("utf8")def string_list_to_utf8(string_list):    return [string_to_utf8(element) for element in string_list]def flatten_list(nested_list):    return list(chain(*[item if isinstance(item, list) else [item] for item in nested_list]))with open('example.csv') as csv_file:    csv_file_content = csv_file.read()    encoding = chardet.detect(csv_file_content)['encoding']    if not encoding:        print 'No encoding found for the file! Is it valid?'    if 'UTF' in encoding:        encoding = encoding.replace('-').lower()    is_first_row = True    valid_columns = ['Col 1', 'Col 2']    reader = csv.DictReader(csv_file, delimiter=';', fieldnames=valid_columns)    for row in reader:        if is_first_row:            current_columns = string_list_to_utf8(flatten_list(row.values()))            if set(valid_columns) != set(current_columns):                print 'This is not the file I expected! I quit!'                break            is_first_row = False            continue        print string_to_utf8(row.get('Col 1')) + ' ' + string_to_utf8(row.get('Col 2'))[/c]As you can see, I added (now really the last) helper function, that changes encoding of columns we want to validate to utf8 to rule out the possibility of our code crashing on the string compare (if valid columns are unicode strings).And that's really all there is to import CSV file in Python! Isn't that awesome?

Summary

As you can see, Python is a great language that enables you to solve problems faster and at a lower initial cost.I hope my code helps you in one way or another as much as it helped me while working on one of our projects. Have a nice day!

Discover More Blog Posts

Explore our collection of insightful blog posts.