PetalData Python Library

The PetalData Python library makes it easy to interact with your cloud app datasets.

The petaldata package downloads CSV files and schema information from the PetalData server and generates Pandas Dataframes with the proper datatypes. Additionally, the library includes storage functionality for saving datasets locally and to Amazon S3.

Installation

The package can be installed via pip:

pip install petaldata

You can also find the PetalData package on PyPI and GitHub.


Storage

The Python library can save your dataset to your local computer, Amazon S3, or both. Local Storage is the default storage location.

When a dataset is downloaded it is initally written to a csv file in local storage. When calling save() on a Dataset, the dataset is saved in a Pickle file that lets us preserve each column's datatype.


Local Storage

By default, PetalData writes all files to os.getcwd() + "/petaldata_cache/". You can specify a different directory:

petaldata.storage.Local.dir = "/tmp/petaldata_cache/"
If the directory doesn't exist it will be created.

Disabling Local Storage

Local storage is always used to download csv files but can be disabled for storing Pickle files:

petaldata.storage.Local.enabled = False

S3 Storage

Saving your datasets to Amazon S3 is a good option for:

  • Remote scripts - rather than download an entire dataset when running a remote script, load the previous version, upsert() the dataset, and save again. This can dramitally speed up the execution time of the script.
  • Teamwide access - write a script that regularly updates and saves a dataset to S3, then give teams access to that dataset. They won't need to download their own full copy.

S3 Configuration

S3 storage must be explicity enabled:

petaldata.storage.S3.enabled = True

petaldata.storage.S3.aws_access_key_id = "[AWS_ACCESS_KEY_ID]"
petaldata.storage.S3.aws_secret_access_key = "[AWS_SECRET_ACCESS_KEY]"
petaldata.storage.S3.bucket_name = "[AWS_BUCKET]"

Once enabled, S3 acts just like local storage.