When analyzing data, it often happens that you need to change the format of your table. It doesn't matter if it's your bank account transactions or your child’s school grades, when you look at and analyze data you often think, "If only I had the data in this format, things would be much easier."
But things can be easier if you are clear about what you want to achieve and if you use the right tools.
What table format my data is in?
Data can come in different formats. Considering the bank account example again, you can have one row for each transaction.
Or you can have a more compact view, with the balance day by day for each type of transaction.
Passing from the first to the second table, we’ve lost some information, since we have summed all the transaction of the same type happened in the same date.
This transformation is usually referred to as an aggregation: data are aggregated according to one or more attributes (in this case date and type) using a specific aggregation operator (in this case the sum).
Another possible view consists in having different columns for each type of transaction.
Passing from Table 2 to Table 3 does not cause loss of information, it’s just a different way to look at the same data.
Usually, data are saved in databases in a format that resembles Table 1. In fact, this format allows you to have all the information necessary for any processing.
Introducing wide and long formats
Comparing Table 2 with Table 3, you can see that Table 2 has a more suitable format for saving. This format is sometimes called sparse because it contains only the information that is strictly necessary. Table 3 contains many cells with zero, which from the point of view of information content, are not particularly relevant and therefore only cause the increase in memory necessary to save the table.
Moreover, the number (and names) of columns in Table 2 is fixed, while in Table 3 number and names of columns can change if new types of transactions are introduced.
Another adjective that we can use to name Table 2 is long: actually, compared to Table 3 it contains more rows. In the same way, Table 3 could be considered wide. Despite being not very suitable for storing information, Table 3 could be very useful when you need to have particular views on your data.
But how can we pass from a long table to a wide one and vice versa? This operation is usually referred to as pivoting.
From long to wide data format
Since data are usually stored in a long format, let’s start from analyzing how we can transform a long table to a wide one. In Table 2 we can identify these sets of attributes:
- Attributes that we want to use as keys in the final table. In our case, it is the Date attribute: actually, in the final table, each row corresponds to a different date.
- Attributes that will be used to create the new columns. In our case, it is Type attribute: we want to create a different column for each value of the Type attribute.
- Attributes that contain the values we want to change the format of. In our case, it is the Value column since it contains the values we are interested in.
Reshaping a table in Rulex is straightforward. You just need to connect a Reshape To Wide task to your data and to select the right parameters: key attributes, long attributes and widened attributes, corresponding to the categories we’ve defined before.
The output table will be like Table 3.
From wide to long data format
But, if usually data come in a long format, why should we need also to convert from wide to long? When could the long format be convenient?
Well, there could be several reasons, but the main one is: because long format is much more efficient if you need to make operations on all the columns.
Suppose that you want to convert the values above e.g. from euros to dollars. Starting from Table 3, you should define different formulas:
$”Bank Transfer” = $”Bank Transfer” * Conversion_Rate $”Online Purchase” = $”Online Purchase” * Conversion_Rate…
As you can see, these formulas depend on the specific transaction type, if you add a new type, you need also to add a new formula, which could be not very convenient with a view to automating analysis flows.
Starting from Table 2, instead, you would need a single formula that does not depend on the transaction type.
$”Value” = $”Value” * Conversion_Rate
If a new transaction type pops up, you don’t need to worry: the same formula will work also for the new type.
Again, reshaping tables to a long format is straightforward in Rulex.
You just need to connect a Reshape To Long task after your wide table and select the columns that you want to transform into a one (or more than one) column.
Remember that, instead of dragging and dropping the list of attributes you can also define criteria such as all the attributes whose name starts with ”Value“.
Combining Reshape To Wide and Reshape To Long you can basically convert your data to the format that is most suitable for your analysis. You just need to have clear what you need to be sure that your data are always processed in the right way.
More about data management
If you want to learn more about data management in Rulex Platform, especially how easy it is to deal with different data formats and databases, check out the article "Smarten up your everyday data management".
Edited by Enrico Ferrari