Let’s consider a Google App Engine project. I’ll talk about google-app-engine-django project with Django in it. At the beginning, let’s create a sample project and a sample app in it:
python manage.py startapp csvup
Add the app to your settings.py:
INSTALLED_APPS = ( # ... csvup, }
and to the root urls.py:
urlpatterns = patterns('', # ... (r'^csvup/', include('csvup.urls')), )
Let’s create the following model in it:
from google.appengine.ext import db from appengine_django.models import BaseModel class Person(BaseModel): name = db.StringProperty() birthdate = db.DateProperty() countriesVisitedCount = db.IntegerProperty() everMarried = db.BooleanProperty()
Now, say we have a legacy DB with 1000 people we’d like to put online into our shiny web app for others to view. You can export it to CSV from your DB, so let’s see how to import that into your application. Note that you can use GAE-supported remote_api, as described here. If you only need a straightforward upload occasionally or for initial datastore population, it’s probably better to go with the GAE-supported way. There are some situations for which you’d like to roll your own solution instead of the remote_api one, such as:
- You’d like to provide non-admin users with a CSV-upload option,
- Different transformations need to be performed before uploading the file,
- CSV upload is just one of the options you’d like to start with and add other formats as you go on (e.g. zipped CSV or YAML).
We’ll start by creating a class:
class Csvup(object): pass
and start adding methods to it for necessary phases.
Parsing CSV
I’m going to stick with properly formatted CSV – dealing with improper CSV is a story on its own. It’s rather easy to set your conventions for this, but I’ll assume the following:
- There’s a header in CSV which matches exactly with property names of our model,
- Dates are in format %Y/%m/%d and
- Booleans are represented as y/n.
Here’s a sample of a valid CSV for the above Person model:
name,birthdate,countriesVisitedCount,everMarried Joe,1990/10/01,3,n Maria,1970/06/05,10,y
First step is to parse the above into Python lists, for which we’ll create the following:
class Csvup(object): def __init__(self, csv): self.csv = csv self.descs = [] def parseCsv(self): pass
Here’s the unit test for the above (using Python’s unittest module):
import unittest import textwrap from csvup.views import Csvup class CsvupTest(unittest.TestCase): def testCsvParsing(self): incsv = textwrap.dedent("""\ name,birthdate,countriesVisitedCount,everMarried Joe,1990/10/01,3,n Maria,1970/06/05,10,y""" ) csvup = Csvup(incsv) descs = csvup.parseCsv() self.assertEquals(len(descs), 2) desc1 = descs[0] vals1 = sorted([d[1] for d in desc1]) self.assertEquals(vals1, ['1990/10/01', '3', 'Joe', 'n'])
The above uses a simple inline CSV document, parses it using our Csvup class and compares the result. An interesting thing is textwrap.dedent method, which allows you to strip the leading spaces from a Python multiline string (curtesy of bignose from here). Put the above in csvup/tests.py and run:
python manage.py test
from the root of the project. It should run all tests – the ones supplied by the frameworks and the one we just created. Instead, you can run:
python manage.py test csvup
to run just our tests. If all is good, we should have a failed test right now, as we don’t have the implementation yet.
Python enables us to parse this rather simply with it’s csv module. Here’s a sample implementation that passes the above test:
import os import csv from csvup.models import Person class Csvup(object): # ... def parseCsv(self): lines = self.csv.split(os.linesep) csvr = csv.reader(lines) header = csvr.next() descs = self.descs for row in csvr: descs.append(zip(header, row)) return descs
Each row in descs is a list of tuples. Each tuple is (property name, value). For simplicity, put this inside your views.py (or change the test above) and run the tests – the only test so far will pass.
Populating model
Now that we got our CSV Pythonized, we need to populate our models. Here’s the additional method in Csvup class and an appropriate test method:
class Csvup(object): def __init__(self, csv): # ... self.models = [] # ... def makeModels(self): self.parseCsv() return self.models class CsvupTest(unittest.TestCase): def setUp(self): incsv = textwrap.dedent("""\ name,birthdate,countriesVisitedCount,everMarried Joe,1990/10/01,3,n Maria,1970/06/05,10,y""" ) self.csvup = Csvup(incsv) def testCsvParsing(self): descs = self.csvup.parseCsv() self.assertEquals(len(descs), 2) desc1 = descs[0] vals1 = sorted([d[1] for d in desc1]) self.assertEquals(vals1, ['1990/10/01', '3', 'Joe', 'n']) def testModelPopulation(self): models = self.csvup.makeModels() self.assertEquals(len(models), 2) model2 = models[1] self.assertEquals(model2.name, 'Maria') self.assertEquals(model2.birthdate, date(1970, 6, 5)) self.assertEquals(model2.countriesVisitedCount, 10) self.assertEquals(model2.everMarried, True)
Note I refactored CsvupTest to do a set up in the setUp method, which gets called before any of the tests is executed. That way we get a fresh Csvup instance each time. Also, Csvup.makeModels is going to call parseCsv in order to guarantee that the CSV is parsed before it attempts to do any model making on its own.
What we are going to do is loop through all the property names supplied and populate the respective property on the model. Before that, the string needs to be converted to the appropriate model type, which depends on the property. To make it simple and consistent:
- For booleans, we’ll use y = True, n = False,
- For dates, we’ll fix the format to %Y/%m/%d. We’ll use datetime module.
For this, we will add the following test:
# ... from datetime import date, datetime class CsvupTest(unittest.TestCase): # ... def testConverters(self): cs = self.csvup.converters self.assertTrue(cs[int](55), 55) self.assertTrue(cs[float](1.23), 1.23) self.assertTrue(cs[bool]('y')) self.assertFalse(cs[bool]('n')) self.assertEqual(cs[date]('1990/10/01'), date(1990, 10, 1)) self.assertEqual(cs[datetime]('1990/10/01 12:34:56'), datetime(1990, 10, 1, 12, 34, 56))
Csvup.converters is a dict that will have the necessary converters. All non-existent are treated as strings. Here’s a sample implementation:
# ... from datetime import date, datetime class Csvup(object): def __init__(self, csv): # ... self.makeConverters() def makeConverters(self): self.converters = { bool: lambda s: s == 'y', int: lambda s: int(s), float: lambda s: float(s), date: lambda s: datetime.strptime(s, '%Y/%m/%d').date(), datetime: lambda s: datetime.strptime(s, '%Y/%m/%d %H:%M:%S'), }
Running the tests will result in two green lights, so we are good so far. Now, what’s left is to make a model loading, i.e. pass the testModelPopulation test. Here’s a sample one:
class Csvup(object): # ... def makeModels(self): self.parseCsv() for desc in self.descs: self.models.append(self.makeModel(desc)) return self.models def makeModel(self, desc): instance = self.mklass() for (name, strval) in desc: prop = self.mklass.properties()[name] converted = strval if prop.data_type in self.converters: converted = self.converters[prop.data_type](strval) instance.__setattr__(name, converted) return instance
Running the tests should give you three green lights.
Progress
So far we have:
- Parsed a CSV,
- Converted all string values to appropriate data types and
- Created models for each row and stored this in a Python list.
In the next post I’ll continue with the implementation of the CSV uploader to GAE.