One of the most frequently talked barriers for getting onto the cloud is the data lock-in. In this post we would try to look at how we can download data as a CSV from the app engine datastore. This would help us get the data back from the app engine, just in case you want to move away from the datastore.
So there exists a tool which can help you get data, that is the good part. The not so good part is that there is nothing available in Java (as of now) to accomplish this. We have to depend on the python sdk. That should give you a hint that we would need to download the python sdk.
It might also be worthwhile to check that you have python installed on your system. When I type python on the command prompt for my Ubuntu 10.04 system, I get
vikas@vikas-laptop:~$ python Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>>
So, good to go. Assuming that you have downloaded the python sdk, you would find the appcfg.py file in the root of the distribution.
Now, we have to set up a remote servlet with our Java application with which would be able to communicate with the app engine remotely. The following needs to be set up in your web.xml
<servlet> <servlet-name>RemoteApi</servlet-name> com.google.apphosting.utils.remoteapi.RemoteApiServlet </servlet> <servlet-mapping> <servlet-name>RemoteApi</servlet-name> <url-pattern>/remote_api</url-pattern> </servlet-mapping>
Deploy your application to the app engine.
Now we would connect to the app engine and generate a bulkloader configuration file. A configuration file could either be auto-generated, the easier way or be written by hand. It contains information about all our entities and their fields.
There are 2 ways to generate the configuration file, either by using appcfg.py or bulkloader.py. Bothe the examples are as
appcfg.py create_bulkloader_config –url=http://<your app>.appspot.com/remote_api –application=<your app_id> --filename=config.yml
bulkloader.py --create_config –url=http://<your app>.appspot.com/remote_api –application=<your app_id> --filename=generated_bulkloader.yaml
Once you execute this command, after the necessary authentication the following file is generated
# Autogenerated bulkloader.yaml file. # You must edit this file before using it. TODO: Remove this line when done. # At a minimum address the items marked with TODO: # * Fill in connector and connector_options # * Review the property_map. # - Ensure the 'external_name' matches the name of your CSV column, # XML tag, etc. # - Check that __key__ property is what you want. Its value will become # the key name on import, and on export the value will be the Key # object. If you would like automatic key generation on import and # omitting the key on export, you can remove the entire __key__ # property from the property map. # If you have module(s) with your model classes, add them here. Also # change the kind properties to model_class. python_preamble: - import: base64 - import: re - import: google.appengine.ext.bulkload.transform - import: google.appengine.ext.bulkload.bulkloader_wizard - import: google.appengine.ext.db - import: google.appengine.api.datastore - import: google.appengine.api.users transformers: - kind: Audit connector: # TODO: Choose a connector here: csv, simplexml, etc... connector_options: # TODO: Add connector options here--these are specific to each connector. property_map: - property: __key__ external_name: key export_transform: transform.key_id_or_name_as_string - property: auditActionType external_name: auditActionType # Type: String Stats: 146 properties of this type in this kind. - property: date external_name: date # Type: Date/Time Stats: 146 properties of this type in this kind. import_transform: transform.import_date_time('%Y-%m-%dT%H:%M:%S') export_transform: transform.export_date_time('%Y-%m-%dT%H:%M:%S')
As you would notice, it would create configurations for all the entities and then leave out some TODOs for you to change. So for example we replace the
connector: # TODO: Choose a connector here: csv, simplexml, etc... connector_options: # TODO: Add connector options here--these are specific to each connector.
connector: csv connector_options: encoding: utf-8 columns: from_header
Once our configuration file is ready, you can start downloading / uploading data to the engine using the configuration file.
To download data, again use any one of the following commands
appcfg.py download_data --config_file=config.yml --filename=UserRole.csv --kind=UserRole –url=http://<your app>.appspot.com/remote_api –application=<your app_id>
bulkloader.py --download --url=http://0-0-4d.latest.bookhours.appspot.com/remote_api --application=bookhours --config_file=generated_bulkloader.yaml --kind=UserRole –filename=UserRole.csv
When the command completes, you would get a csv file for the UserRole which would be something similar to this depending on your entity.
roleName,role,key Consultant,ROLE_CONSULTANT,85004 PM,ROLE_PROJECTMANAGER,88003 Report role,ROLE_REPORT,90002 Administrator,ROLE_ADMIN,126002
There is also a command, which lets you download all the entities all at once
appcfg.py download_data –application= --url=http://.appspot.com/remote_api –filename=data.csv
One of the problems of this usage is that the csv file is not a text file. It seems to be a sqlite file. I have to figure out whether I can open it with sqlite.
Another issue that I found with bulkloader is that it seems to ignore the relationships in the generated configuration file. For example, if there is a relationship between User and UserRole then that is not evident from the auto generated configuration file. Refer to my issue raised on GAE forums.
Nevertheless, we are able to get data back from the app engine, albeit for each individual entity as a csv file. Once we have the csv, it becomes easy for us to put the concerns about data lock-in to rest.