Jump to content

Rulex Platform vs Pandas: Performance Comparison


Matteo Aragone

863 views

One of the qualities that customers appreciate the most about Rulex Platform is its exceptional performance levels, and its ability to effectively process data, even big data, much faster than any of its competitors.

This makes Rulex Platform the optimal choice for executing multiple operations at a rapid pace, from gathering incoming data with diverse data types, to applying machine algorithms, and handling complex data engineering projects.

Comparing data performance

Our data science team, in particular @Walter Rossi, conducted a series of tests comparing the performance of Rulex Platform with that of Pandas, the popular Python-based data manipulation tool. The results were overwhelmingly in favor of Rulex technology:

  • Rulex Platform outperformed Pandas in 25 out of 30 tests.
  • Rulex Platform demonstrated better memory management in 28 out of 30 tests.

The difference in data processing speed between the two platforms was especially noticeable with the largest dataset, consisting of 50 million rows. In one such test, Pandas took 30 minutes to process the data, while Rulex Platform accomplished the same task in just 26 seconds!

Below are the full details of the comparative tests.

Testing methodology

We conducted our tests using the following datasets: https://opendata.rdw.nl/Voertuigen/Open-Data-RDW-Gekentekende_voertuigen/m9d7-ebf2

  • The first dataset includes the first 5 million rows.
  • The second dataset contains all 15 million rows.
  • The third dataset is a concatenation of the Gekentekende_voertuigen dataset, to reach a total of 50 million rows.

The complete list of operations performed is as follows:

ID
Operation
Description
1
["Voertuigsoort"] = ["Cilinderinhoud"]/["Massa ledig voertuig"]
Ratio of two numerical columns
2
["Concat"] = ["Merk"] + ["Eerste kleur"]
Concatenation of two text columns
3
Sort by "Bruto BPM", ascending
Sort (by numerical column)
4
Sort by "Merk", ascending
Sort (by text column)
5
Group by "Datum tenaamstelling", "Typegoedkeuringsnummer" and mean on "Bruto BPM"
Group by text column + mean on numerical column
6
Filter on "Bruto BPM" > 500
Filter on numerical column
7
Filter on "Merk" = "FIAT" or "TOYOTA" or "OPEL"
Filter on text column
8
Sequence of operations: 1,2,3,4,6,7
Sequence of operations
9
Join on "Merk" with a dictionary
Join (equality condition)
10
Sequence of all previous operations: 1,2,5,3,4,6,7,9 (after the group operation the dataset's dimension is reduced)
Sequence of all previous operations

Table 1

The characteristics of the machine used for testing is as follows:

  • CPU: i7 12th gen i7-1255U 1.70 GHz
  • RAM: 32 GB
  • OS: Windows 11

Performance results

Tests on datasets with 5 million rows of data

Our tests showed that in 7 out of 10 cases, Rulex Platform outperformed Pandas in terms of speed.

In some scenarios, the performance of the two tools was similar, but in others, Rulex Platform demonstrated significantly faster processing times.

For instance, in one particular sequence of operations, Rulex Platform took only 2 seconds to complete, while Pandas took 22 seconds. In a different sequence of operations, Rulex Platform took only 9 seconds to complete, compared to 47 seconds for Pandas.

Additionally, Rulex Platform showed superior memory management compared to Pandas, with peak memory usage being approximately four times lower, as Pandas used around 8GB of memory while Rulex Platform used only about 2GB.

spacer.png

spacer.png

spacer.png

Tests on dataset with 15 million rows of data

The test on 15 million rows showed that Rulex Platform outperformed Pandas in 8 out of 10 tests in terms of speed.

The results demonstrated that Rulex Platform consistently outpaced Pandas, with the greatest difference observed in the sequence of operations. For example, in one particular sequence, Rulex Platform took only 8 seconds compared to Pandas' 119 seconds, while in another sequence, Rulex Platform took only 28 seconds compared to Pandas' 185 seconds.

Moreover, Rulex Platform once again proved to be more memory-efficient than Pandas, with the most significant difference recorded in the sorting operation. While Pandas used up to 25GB of memory in a test, Rulex Platform utilized only about 5GB, highlighting its superior memory management capabilities.

spacer.png

spacer.png

spacer.png

Tests on dataset with 50 million rows of data

In this test, Rulex Platform consistently outperformed Pandas in terms of speed, with the difference in performance becoming increasingly striking with larger datasets.

Once again, the greatest difference in performance levels was observed in the sequence of operations. Pandas took 1823 seconds, which is more than 30 minutes, while Rulex Platform completed the same operations in just 26 seconds.

Moreover, Rulex Platform demonstrated superior memory management compared to Pandas, with lower memory usage peaks in 8 out of 10 tests.

It is worth noting that for Pandas, the maximum memory usage approached the capacity of the machine (approximately 25GB, while the machine had 32GB), and in such cases, Pandas performance levels degraded significantly in terms of speed, while Rulex Platform continued to perform optimally.

spacer.png

spacer.png

spacer.png

More about processing data in Rulex Platform

If you are interested in learning more about Rulex Platform's data performance, read the article: Superior Data Performance: Rulex Outperforms Pandas.

Feel the speed of Rulex Platform

Want to try Rulex Platform straightway? Get a 30-day free trial.

Edited by Matteo

  • Like 1
  • Endorse 2

0 Comments


Recommended Comments

There are no comments to display.

Guest
Add a comment...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...