Data Ingestion

The current version of our API, allows for datasets to be uploaded into our platform. However, we used a different internal process for ingesting data from public sources, which allows us to perform data transformations, aggregation, and further manipulation. We are now isolating such processes allowing for data to be "Adapted" into the final Dataset. As we do that, we are also storing every raw dataset version, and keeping track of every difference.

Data Ingestion

This new data ingestion platform also allows working with data that has multiple dimensions. In the past, we would only accept Date, and a general Entity dimension, that were used across different Variables in the Dataset:

DateEntityProductionImportsExports
2020-01-01USA21232300025000
2020-02-01Canada32232400027000
2020-03-01Mexico54232200021000

Our API now works with "Columns" rather than Variables, and each column can be specified as Entity, allowing for composed keys, or in other words multiple dimensions:

DateCountryStateProductionImportExport
2020-01-01USAFlorida21232300025000
2020-02-01USACalifornia32232400027000
2020-03-01USATexas54232200021000

Data Pipelines

Having a better detail of the structure of each dataset, allows for richer data manipulation scenarios, which we will be enabling as Pipelines:

Screen Shot 2021-08-13 at 2.57.48 PM.png

A Pipeline is triggered whenever a participating Dataset gets updated. Then, individual columns of each dataset are extracted as defined by the pipeline, data is transformed, aggregated or combined, and new Datasets can be created. This may sound like complex scenarios, but we are working to make it super simple to combine data from multiple datasets on our web portal.

Miguel  Saez

Written by

Miguel Saez

One of the guys building Alphacast. Hope you enjoy it!

Published in

My Public Repo

You can use you first public repository to share content with the community

Related insights