How to merge the content of two datasets?

Surely in your usual work with data, you needed to join several data sources and if your calculation tool is Excel you may solve it with some combination of the VLOOKUP, HLOOKUP, and/or MATCH formulas. Excel is a great solution in many cases, but it can be difficult in some scenarios. For example when...

  1. ...You have MANY rows. VLOOKUP can have performance issues and be very slow
  2. ...you need to search more than one field to combine the data
  3. ...the position of rows or columns changes
  4. ...you only need the data that is in both datasets
  5. ...some of the data sources change the number of rows and you have to copy or adjust the formulas.

With Alphacast you can use pipelines to combine datasets and keep them connected.

Step 1. Choose a data source

To merge two datasets, first click the Create new button and choose pipeline. Once there use Fetch dataset to select the required dataset. You can also begin the process by clicking "Transform data" in the dataset you want to merge

Step 2. Select the data source to "Merge"

Then click Add step below and choose the option Merge with Dataset, there you select the dataset you want to add. The best dataset combinations are obtained with data that share the frequency (daily, monthly, quarterly, or yearly).

Step 3. Choose the common fields

If we have two datasets, we have to tell the system what the "splice" method is between both datasets. That is, what will be the fields that must be in one and another dataset from which to join them.

  • Usually there will be only one Date, in which case both datasets will be Merged by their dates.
  • In addition to the date, datasets can have more than one entity. For example, they can have data by date and by country. In this case, it will be necessary to identify, if any, which is the field of the second dataset that corresponds to the country field.
  • If a field for the second country is not selected, the connection will only be through the date field. In this case, the rows in dataset B may appear duplicated if there is more than one occurrence of your date in dataset A.

image.png

In this example, we used two datasets with a monthly frequency and the same entity (Argentina). The result of this combination, when choosing the Left Join option, is that all the data from the first dataset (EMAE) will remain. Those that will be incorporated will be those data from the Consumer Price Index that coincide in date and entity.

Step 4. Choose the Matching type

There are four types of criteria for joining

image.png

  • Inner join: The new dataset will have only those rows that can be matched.
  • Left join: All the rows of dataset A will be present and the unmatched rows of dataset B are discarded.
  • Right join: Reverse to the previous one. All those from dataset B and discarded the unmatched ones from A.
  • Outer join: The data from both datasets will remain even if they do not match.

Paso 5. Publish

As a result of the previous step, the combination of the columns of Dataset A and Dataset B will be obtained. From here you can continue processing it or publish it in a new dataset.

Related insights

  • Read more... Excel and Google Sheets allow adding data from different sources. Here you can find an alternative way to embed data into Excel, by using our TSV data source:

    From a Dataset

    First of all, filter the information you want to use. Excel and Google Sheet limit the information that can be downloaded

  • Read more...

    How to convert a series to the official USD or Blue Chip Swap?

    The pipeline engine "Apply Transform" step incorporates a new transformation that allows changing the source unit: Convert to dollar official or to Blue Chip Swap (for Argentina only).

    The pipeline is separated into Two steps

    1. Select ("Fetch") the dataset and its columns
    2. "Apply transform"
  • Read more...

    How is a Time Series seasonally adjusted?

    Removing seasonality from time series is always complicated and laborious. The standard deseasonalization method is X-13ARIMA-SEATS or some other version of the methodologies maintained by the United States Census Bureau. Denationalizing usually includes using some application such as Eviews, Demetra or Stata or Python, combining it with the files that are downloaded

  • Read more...

    How to calculate a monthly end-of-period series?

    Pipelines are an easy way to apply transformations to datasets that update automatically every time data is updated.

    Suppose we have a daily data series for which we need only the last value of each month. It is possible to do that in Excel. For example, an auxiliary column is added that