Knowledge evaluation and visualization with Python, SQLAlchemy, and Bokeh
When was the primary electrical automobile registered? (Spoiler: it was a lot sooner than most individuals might imagine.) Which automobiles are dearer, the electrical Porcshe or Jaguar? Exploratory information evaluation (EDA) just isn’t solely an essential a part of constructing each information pipeline, however it is usually a reasonably attention-grabbing course of. Within the first half, I analyzed the RDW (Netherlands Automobile Authority) dataset with Python and Pandas, and one of many challenges was the massive (about 10 GB) dataset measurement. As a workaround, I specified the columns record, which must be loaded in Pandas. It really works, however what if the dataset is bigger, and there may be nonetheless not sufficient RAM to place all the information in reminiscence, or if the dataset is positioned on a distant database? On this article, I’ll present the right way to make the same evaluation utilizing SQLAlchemy. This may enable making “heavy” information processing utilizing SQL, with out the necessity to load all information in Pandas.
Let’s get began.
Loading the information
The RDW (“Rijks Dienst Wegverkeer”, is a Dutch group that handles approvals and registration of motorized autos and driving licenses within the Netherlands. I’ll use the “Gekentekende voertuigen” (“Automobiles with license plates”) dataset. As was described within the first half, it’s obtainable beneath a Public Area license and might be downloaded from opendata.rdw.nl. For information processing, I might be utilizing SQLite, which is a free and light-weight database engine, it could simply run on any PC.
At first, we have to obtain and import the CSV file to SQLite. The file measurement is about 10 GB; it may be downloaded at no cost and no registration is required. To import the information, I run the “sqlite3 rdw_data.db” command, and enter 3 instructions:
sqlite> .mode csvsqlite> .import Open_Data_RDW__Gekentekende_voertuigen.csv rdw_datasqlite> .stop
Right here “Open_Data_RDW__Gekentekende_voertuigen.csv” is the unique CSV file, and “rdw_data” is a desk, which needs to be created. The import course of takes a while, and after that, we’re accomplished with the command line and might return to Jupyter Lab. First, let’s make the wanted imports and see which database columns we now have:
from sqlalchemy import create_engine, MetaData, desk, column, choose, funcfrom sqlalchemy import inspectpimport pandas as pd
rdw_db = create_engine(‘sqlite:///rdw_data.db’)
table_name = ‘rdw_data’
with Session(rdw_db) as session: insp = examine(rdw_db)columns = insp.get_columns(“rdw_data”)show(pd.DataFrame(columns))
I’m utilizing Pandas DataFrame for displaying the outcomes as a result of its output is way more handy to learn. For instance, the “show(columns)” will present the output like this:
On the identical time, the “show(pd.DataFrame(columns))” output appears a lot better:
Let’s test what we get. We will see that every one columns have TEXT sort, so we might want to convert the values. There are 91 columns within the database, however from the precise evaluation, I want solely the automobile’s sort, license plate, mannequin title, value, and registration date. I will even use a “Variety of cylinders” as a helper to detect if the automobile is electrical or not. Final however not least, I’m going to investigate solely “private” (“Personenauto” in Dutch) automobiles, not tracks or buses, so I’ll use this filter within the SQL question.
Let’s do that conversion utilizing SQL:
with Session(rdw_db) as session: session.execute(textual content(‘DROP TABLE IF EXISTS rdw_cars’))session.execute(textual content(‘CREATE TABLE rdw_cars(“index” INTEGER PRIMARY KEY AUTOINCREMENT, ””Mannequin” TEXT, ””Commerce title” TEXT, ””License Plate” TEXT, ””Variety of Cylinders” INTEGER, ””Catalog value” INTEGER, ””First registration NL” TEXT, ””Is electrical” INTEGER DEFAULT 0)’))
session.execute(textual content(‘BEGIN TRANSACTION’)) session.execute(textual content(‘INSERT INTO rdw_cars(“Mannequin”, “Commerce title”, “License Plate”, “Variety of Cylinders”, “Catalog value”, “First registration NL”) ”SELECT ””Merk”, ””Handelsbenaming”, ””Kenteken”, ”(CASE WHEN LENGTH(“Aantal cilinders”) > 0 THEN CAST(“Aantal cilinders” as INTEGER) ELSE NULL END), ”(CASE WHEN LENGTH(“Catalogusprijs”) > 0 THEN CAST(“Catalogusprijs” as INTEGER) ELSE NULL END), ”DATE(SUBSTR(“Datum eerste tenaamstelling in Nederland”, 1, 4) || “-” || SUBSTR(“Datum eerste tenaamstelling in Nederland”, 5, 2) || “-” || SUBSTR(“Datum eerste tenaamstelling in Nederland”, 7, 2)) ” FROM rdw_data WHERE “Voertuigsoort” = “Personenauto”‘))
session.execute(textual content(‘COMMIT’))
Right here I created a brand new desk and transformed Integer and Date columns into correct codecs. I changed all empty strings with NULLs, and as a bonus for readers, I translated Dutch column names to English. I additionally created the “Is electrical” column, which I’ll use later.
The preliminary conversion is completed, and we’re able to go.
Fundamental evaluation
At first, let’s see the principle properties of the dataset, like information samples, dimensionality, and the variety of NULL values.
Utilizing SQL, we will get the overall quantity of data:
with Session(rdw_db) as session: q = session.execute(textual content(‘SELECT COUNT(*) FROM rdw_cars’)).scalar()print(“Vehicles complete:”, q)
In complete, we now have 9,487,265 automobiles, registered within the Netherlands on the time of writing this text (for these readers, who will obtain the dataset later, this quantity will clearly, be larger). The entire quantity can be equal to the quantity, I obtained within the first half, the place I did the same evaluation utilizing Pandas-only — it’s a simple option to test if the processing is correct.
Let’s now see the primary 5 samples within the database; it’s straightforward to do utilizing SQL. Right here and later I might be utilizing Pandas to show tables, it’s handy as a result of Pandas has native SQL binding:
with Session(rdw_db) as session:df = pd.read_sql_query(textual content(“SELECT * FROM rdw_cars LIMIT 5″), con=session.connection(), dtype={‘Catalog value’: pd.UInt32Dtype(), ‘Variety of Cylinders’: pd.UInt32Dtype()})show(df.type.cover(axis=”index”))
The consequence appears like this:
Let’s test what number of lacking/NULL numbers we now have in several columns. Pandas DataFrame has a handy methodology “df.isna().sum()”, however I used to be not capable of finding one thing comparable in SQL. We have to specify all of the columns, we have to test:
with Session(rdw_db) as session:request = (‘SELECT ” SUM(CASE WHEN “Mannequin” IS NULL OR “Mannequin” = “” THEN 1 ELSE 0 END) AS model_no_data, ” SUM(CASE WHEN “Commerce title” = “” THEN 1 ELSE 0 END) AS trade_name_empty, ” SUM(CASE WHEN “Commerce title” IS NULL THEN 1 ELSE 0 END) AS trade_name_nulls, ” SUM(CASE WHEN “License Plate” IS NULL OR “License Plate” = “” THEN 1 ELSE 0 END) AS lp_no_data, ” SUM(CASE WHEN “Variety of Cylinders” = 0 THEN 1 ELSE 0 END) AS num_cylinders_zeros, ” SUM(CASE WHEN “Variety of Cylinders” IS NULL THEN 1 ELSE 0 END) AS num_cylinders_nulls, ” SUM(CASE WHEN “Catalog value” = 0 THEN 1 ELSE 0 END) AS price_zeros, ” SUM(CASE WHEN “Catalog value” IS NULL THEN 1 ELSE 0 END) AS price_nulls, ” SUM(CASE WHEN “First registration NL” IS NULL THEN 1 ELSE 0 END) AS registration_nulls, ” COUNT(*) AS complete ”FROM rdw_cars’)
df = pd.read_sql(textual content(request), con=session.connection())show(df.type.cover(axis=”index”))
Utilizing SQL, I calculate the sum of values, which might be NULL or empty. The consequence appears like this:
Right here we will see the overall quantity of automobiles (9,487,265). Each automobile has a license plate and a registration date; these fields are most likely necessary for registration. However 2,480,506 data do not need a value. 864 do not need a “commerce title”, and so forth. And right here I can see an issue — these 864 data with empty “commerce title” fields don’t match the 1,405 empty data, I obtained in Pandas within the first half. That is clearly, not proper, the place is the distinction? It’s unattainable to test 9,487,265 data manually, and the best option to debug the issue was to avoid wasting distinctive “commerce title” values to the textual content file and to check each recordsdata utilizing the “Diff” device. And it turned out, that the issue was easy however attention-grabbing — within the first half, I used to be utilizing the “pd.read_csv” methodology to load the information. This methodology is “sensible sufficient” to routinely substitute “NULL”, “NA”, “N/A” and another values (a full record might be discovered within the guide) with NULLs, and this conversion is enabled by default. In our case, Mazda NA is an actual automobile mannequin, and Pandas routinely transformed all names “NA” to NULL for these automobiles (it additionally jogs my memory of the outdated story in regards to the man with the final title Null, who was “invisible for computer systems”;). Anyway, the Mazda NA automobile just isn’t electrical, so it doesn’t have an effect on the outcomes of the primary half, however it’s good to remember that such issues could happen.
However let’s return to evaluation. Utilizing SQL, we will simply make helpful requests, for instance, let’s see the highest 10 most costly automobiles within the Netherlands:
with Session(rdw_db) as session:df = pd.read_sql(textual content(‘SELECT “Mannequin”, “Commerce title”, “Catalog value”, “First registration NL” FROM rdw_cars ORDER BY “Catalog value” DESC LIMIT 10’), con = session.connection())show(df)
The result’s attention-grabbing:
I used to be anticipating to see Porsche, Mercedes, or BMW on this record, however having Peugeot or Fiat right here is stunning for me, although, I’m not an skilled in luxurious automobiles.
Knowledge remodel
We already did some primary evaluation utilizing SQL requests, however the function of this text is to investigate electrical automobiles. To detect if the automobile is electrical or not, we have to know its producer and mannequin title. Ideally, if electrical automobiles would have “ELECTRIC” of their title, the duty can be simple. However in the actual life, there is no such thing as a logic within the mannequin naming in any respect. “Mazda MX-30” is electrical, however “Mazda MX-5” just isn’t. “Kia Niro” is electrical and “Kia Sorento” just isn’t, and so forth. The is not any rule on this, and the best means is simply to create a desk of electrical automobile fashions and use it. However first, let’s test if automobile fashions and commerce names within the dataset are constant.
Let’s confirm automobile fashions first, for instance, let’s see all PEUGEOT automobiles:
with Session(rdw_db) as session:df = pd.read_sql_query(textual content(‘SELECT “Mannequin”, COUNT(*) AS Depend FROM rdw_cars WHERE “Mannequin” LIKE “%PEUGEOT%” GROUP BY “Mannequin” ‘), con = session.connection())show(df.type.cover(axis=”index”))
The consequence appears like this:
Virtually all automobiles within the database have the title “PEUGEOT”, which is ok, however a number of automobiles have longer names, like “PEUGEOT BOXER”. The primary phrase is sufficient to know the automobile mannequin, so we will simply take away the remaining. And it will likely be extra handy for future evaluation; for instance, we will group automobiles by mannequin and see what number of Peugeot automobiles have been offered. Within the first half, I already created a technique to take away redundant phrases from the mannequin title:
def model_normalize(s_val):””” “PEUGEOT BOXER/GLOBE-TRAVE ” => “PEUGEOT” “””if s_val and isinstance(s_val, str) and len(s_val) > 0:return s_val.substitute(“-“, ” “).substitute(“/”, ” “).break up()[0].higher().strip()return None
Let’s now test the subsequent column. Automotive commerce names within the dataset can generally be duplicated with fashions, just like the “NISSAN” automobile on this instance:
I created a technique to take away these duplicates, on this instance, it should convert the “NISSAN MURANO” subject to simply “MURANO”.
def name_normalize(mannequin: str, trade_name: str):””” Take away duplicates and convert the title to higher case “””if isinstance(trade_name, str) and len(trade_name) > 0:title = trade_name.higher().strip()# Take away duplicates from mannequin and commerce title: # (“TESLA”, “TESLA MODEL 3”) => (“TESLA”, “MODEL 3”)if title.break up()[0] == mannequin:# “TESLA MODEL 3” => [TESLA, MODEL, 3] => “MODEL 3″return ‘ ‘.be a part of(title.break up()[1:]) return namereturn None
Now we will lastly determine if the automobile is electrical or not. Within the first half, I already created a technique for that:
electric_cars = {“AIWAYS”: [‘U5’, ‘U6’],”AUDI”: [‘E-TRON’],”BMW”: [‘I3’, ‘I4’, ‘I7’, ‘IX’],”CITROEN”: [‘E-C4’],”FIAT”: [‘500E’, ‘ELETTRA’],”FORD”: [‘MACH-E’],”HONDA”: [‘”E”‘, ‘”E ADVANCE”‘], “HYUNDAI”: [‘IONIQ’, ‘KONA’],”JAGUAR”: [‘I-PACE’],”KIA”: [‘NIRO’, ‘E-SOUL’],”LEXUS”: [‘RZ’],”LUCID”: [‘AIR’],”MAZDA”: [‘MX-30’],”MERCEDES”: [‘EQA’, ‘EQB’, ‘EQC’, ‘EQS’, ‘EQV’],”MG”: [‘ZS EV’],”MINI”: [‘COOPER SE’],”NISSAN”: [‘ALTRA’, ‘ARIYA’, ‘EVALIA’, ‘LEAF’, ‘NUVU’],”OPEL”: [‘AMPERA-E’, ‘COMBO-E’, ‘CORSA-E’, ‘MOKKA-E’, ‘VIVARO-E’, ‘ZAFIRA-E’],”PEUGEOT”: [‘E-208’, ‘E-2008’, ‘E-RIFTER’, ‘E-TRAVELLER’],”POLESTAR”: [‘2’, ‘3’],”PORSCHE”: [‘TAYCAN’],”RENAULT”: [‘MASTER’, ‘TWINGO’, ‘KANGOO ELEC’, ‘ZOE’],”SKODA”: [‘ENYAQ’],”SMART”: [‘EQ’],”TESLA”: [”],”TOYOTA”: [‘BZ’],”VOLKSWAGEN”: [‘ID.3’, ‘ID.4’, ‘ID.5’, ‘E-GOLF’],”VOLVO”: [‘C40’, ‘XC40’]}
def check_is_electric(mannequin: str, trade_name: str, cylinders: int):””” Decide if the automobile is electrical “””if isinstance(cylinders, int) and cylinders > 0:return Falsefor e_model, e_names in electric_cars.gadgets():if mannequin == e_model:for e_name in e_names:if trade_name and (e_name in trade_name or e_name.substitute(‘”‘, ”) == trade_name):return Trueif trade_name is None and len(e_name) == 0:return Truereturn False
On this code, I seek for particular key phrases; for instance, if the mannequin is “BMW”, then the “I3” commerce title will present us that this automobile is electrical. And as an additional test (some automobiles could also be electrical or hybrid), I additionally analyze the variety of cylinders which have to be 0 or NULL for electrical automobiles.
All 3 strategies have been already examined within the first half, they usually work effectively, for instance, I can simply apply a model_normalize methodology to the Pandas dataset utilizing one line of code:
df[“Model”] = df[‘Model’].map(lambda s: model_normalize(s))
However how can we use it with SQL? Properly, instantly we will’t, however we will do that with the assistance of ORM.
SQLAlchemy ORM
ORM (Object Relational Mapping) is a expertise, used for making a “bridge” between the OOP language and relational databases. Virtually, we will create a particular Python class, and SQLALchemy will routinely remodel all requests to this class into SQL. Which is fairly handy, and permits builders to put in writing code in pure Python, with out ugly brackets and lengthy SQL strings.
Let’s create a “Automotive” class and put the wanted strategies there:
from sqlalchemy.orm import DeclarativeBasefrom sqlalchemy import String, Integer, Date, Column
class Base(DeclarativeBase):cross
class Automotive(Base):__tablename__ = “rdw_cars”index = Column(“index”, Integer, primary_key=True) license_plate = Column(“License Plate”, String)mannequin = Column(“Mannequin”, String) trade_name = Column(“Commerce title”, String) num_cylinders = Column(“Variety of Cylinders”, Integer) first_registration = Column(“First registration NL”, Date) value = Column(“Catalog value”, Integer) is_electric = Column(“Is electrical”, Integer)
def model_normalize(self):””” “PEUGEOT BOXER/GLOBE-TRAVE ” => “PEUGEOT” “””if self.mannequin and isinstance(self.mannequin, str) and len(self.mannequin) > 0:self.mannequin = self.mannequin.substitute(“-“, ” “).substitute(“/”, ” “).break up()[0].higher().strip()
def name_normalize(self):””” Take away duplicates from mannequin and commerce title: (“TESLA”, “TESLA MODEL 3”) => (“TESLA”, “MODEL 3”) “””if isinstance(self.trade_name, str) and len(self.trade_name) > 0:title = self.trade_name.higher().strip()if title.break up()[0] == self.mannequin:# “TESLA MODEL 3” => [TESLA, MODEL, 3] => “MODEL 3″self.trade_name = ‘ ‘.be a part of(title.break up()[1:]) else:self.trade_name = title
def check_electric(self):self.is_electric = check_is_electric(self.mannequin, self.trade_name, self.num_cylinders)
For example of this method, let’s show the most costly automobiles of the particular mannequin within the Netherlands. With a regular SQL, we will make a request like this:
with Session(rdw_db) as session:mannequin = “BMW”restrict = 5df = pd.read_sql_query(textual content(f’SELECT “Mannequin”, “Commerce title”, “Catalog value”, “First registration NL” FROM rdw_cars WHERE Mannequin = “{mannequin}” ORDER BY “Catalog value” DESC LIMIT {restrict}’), con=session.connection())show(df.type.cover(axis=”index”))
Consequently, we get the desk:
It really works, however the SQL string is fairly lengthy, and I had to make use of the f-string so as to add the wanted variables to the request. With the assistance of Object Relational Mapping, I can simply use normal Python code:
with Session(rdw_db) as session:mannequin = “BMW”restrict = 5df = pd.read_sql_query(choose(Automotive.mannequin, Automotive.trade_name, Automotive.value, Automotive.first_registration).filter(Automotive.mannequin == mannequin).order_by(Automotive.value.desc()).restrict(restrict), con=session.connection())show(df.type.cover(axis=”index”))
SQLAlchemy will create a correct SQL request “beneath the hood”, and in consequence, we now have an easier-to-read Python code. The main downside, although, is that SQL is kind of normal; there are lots of sources and tutorials about it, however the SQLAlchemy code is particular solely to this library. However for our activity, this method works effectively.
Utilizing ORM, we will simply apply our “check_electric” methodology to all of the data within the database:
with Session(rdw_db) as session:cars_total = session.question(Automotive).rely()
index = 0batch_size = 25000while True:pos1, pos2 = index*batch_size, (index + 1)*batch_sizeif index % 20 == 0:print(f”Processing {pos1} to {pos2}, {100*index*batch_size//cars_total}%…”)
automobiles = session.question(Automotive).filter(Automotive.index.between(pos1, pos2)).all()if len(automobiles) == 0:breakfor automobile in automobiles:automobile.model_normalize()automobile.name_normalize()automobile.check_electric()
session.flush()index += 1
session.commit()
On this code, I learn data from the database, replace the parameters, and save information again to the desk. SQLAlchemy will replace the desk utilizing SQL requests, and it’s undoubtedly slower than updating Pandas Dataframe instantly in reminiscence. Calling the identical strategies in Pandas took solely 130 seconds, and the SQLAlchemy requests took 390 seconds, so the distinction is about 3 occasions. On the opposite facet, for the batch replace, we want a lot much less reminiscence, and there’s no must maintain the entire dataframe in RAM.
Evaluation
After updating the desk, we’re lastly able to go. As a warm-up, let’s calculate the automobile costs’ imply and percentiles.
Calculating the imply is simple, and might be accomplished with SQLAlchemy in 1 line of code. Let’s get the overall quantity of automobiles and their arithmetic value imply:
with Session(rdw_db) as session:c_total = session.question(Automotive).rely()print(f”Vehicles complete: {c_total}”)c_el = session.question(Automotive).filter(Automotive.is_electric == 1).rely()print(f”Vehicles electrical: {c_el} ({100*c_el/c_total:.2f}%)”)
pm = session.question(func.avg(Automotive.value)).scalar()print(“Value imply:”, pm)
pm_el = session.question(func.avg(Automotive.value)).filter(Automotive.is_electric == 1).scalar()print(“Electrical automobiles value imply:”, pm_el)
Getting the percentiles is a little more difficult, and we now have two methods of doing that. We will load solely the “costs” column, then it’s attainable to make use of NumPy “percentile” methodology to do the mathematics:
costs = session.question(Automotive.value).filter(Automotive.value != None).all()print(“All automobiles percentiles [5, 50, 95]:”, np.percentile(costs, [5, 50, 95]))
prices_el = session.question(Automotive.value).filter((Automotive.value != None) & (Automotive.is_electric == 1)).all()print(“Electrical automobiles percentiles [5, 50, 95]:”, np.percentile(prices_el, [5, 50, 95]))
If the dataset is giant and we wish to keep away from loading the information in any respect, we will get the percentile utilizing pure SQL by combining “order_by”, “restrict” and “offset”:
num_total = session.question(Automotive).filter(Automotive.value != None).rely()p5 = session.question(Automotive.value).filter(Automotive.value != None).order_by(Automotive.value).offset(num_total*5/100 – 1).restrict(1).scalar() p50 = session.question(Automotive.value).filter(Automotive.value != None).order_by(Automotive.value).offset(num_total*50/100 – 1).restrict(1).scalar() p95 = session.question(Automotive.value).filter(Automotive.value != None).order_by(Automotive.value).offset(num_total*95/100 – 1).restrict(1).scalar() print(“All automobiles percentiles [5, 50, 95]:”, p5, p50, p95)
num_el = session.question(Automotive).filter((Automotive.value != None) & (Automotive.is_electric == 1)).rely()p5 = session.question(Automotive.value).filter((Automotive.value != None) & (Automotive.is_electric == 1)).order_by(Automotive.value).offset(num_el*5/100 – 1).restrict(1).scalar() p50 = session.question(Automotive.value).filter((Automotive.value != None) & (Automotive.is_electric == 1)).order_by(Automotive.value).offset(num_el*50/100 – 1).restrict(1).scalar() p95 = session.question(Automotive.value).filter((Automotive.value != None) & (Automotive.is_electric == 1)).order_by(Automotive.value).offset(num_el*95/100 – 1).restrict(1).scalar() print(“Electrical automobiles percentiles [5, 50, 95]:”, p5, p50, p95)
The outcomes are attention-grabbing:
The Netherlands is a rustic with fairly excessive common salaries, however in the mean time of writing this text, solely 2,93% of automobiles are electrical. The median value for all automobiles is €26,341, and electrical automobiles are “on common” 2x dearer; their median value is €49,975. The ninety fifth percentile for all automobiles is €73,381, which signifies that 95% of the automobiles have a cheaper price. On the identical time, 95% of electrical automobiles have a value decrease than €106,989.
Let’s now discover one thing extra enjoyable. Let’s get the Prime-20 electrical automobiles within the Netherlands:
with Session(rdw_db) as session: n_top = 20
# Group by automobile modelmodels_amout = session.question(Automotive.mannequin, func.rely(Automotive.mannequin)).filter(Automotive.is_electric == 1).group_by(Automotive.mannequin).order_by(desc(func.rely(Automotive.mannequin))).restrict(n_top).all()[::-1]
# Unzip array [(‘TESLA’, 65896), (‘VOLKSWAGEN’, 28559)] to 2 partsmodels, quantity = zip(*models_amout)
# Showp = determine(y_range=fashions, width=1200, top=500, title=”Prime-%d electrical automobile producers within the Netherlands (information 2023)” % n_top)p.hbar(proper=quantity, y=fashions, top=0.8, colour=Viridis256[:n_top])p.xgrid.grid_line_color = Nonep.x_range.begin = 0p.beneath[0].formatter.use_scientific = Falsep.xaxis.axis_label = “Vehicles complete”present(p)
As we will see, Tesla has the primary place, with greater than 55,000 automobiles registered within the nation:
I used to be interested by which Tesla mannequin is the preferred. To know this, we will change the request:
models_amout = session.question(Automotive.trade_name, func.rely(Automotive.trade_name)).filter(Automotive.mannequin == “TESLA”).group_by(Automotive.trade_name).order_by(desc(func.rely(Automotive.trade_name))).order_by(Automotive.trade_name).all()[::-1]…
It’s clear that the “Mannequin 3” is the preferred electrical automobile in the mean time of writing this text:
However we will additionally see that the dataset clearly wants extra cleansing: some Tesla automobiles have been registered as “MODEL 3”, some as “MODEL3”, some automobiles have been saved as a “ROADSTER”, and a few as “RAODSTER”, and so forth.
Let’s now group the electrical automobile registrations per date. To make the graph extra clear, I wish to group dates by quarter, however the code for extracting quarters in SQL might be cumbersome. As an alternative, I’ll group registrations per day utilizing SQL, then I can calculate quarters utilizing the inner Pandas perform:
with Session(rdw_db) as session:regs_amount = session.question(Automotive.first_registration, func.rely(Automotive.first_registration)).filter(Automotive.is_electric == 1).group_by(Automotive.first_registration).order_by(Automotive.first_registration).all()
df = pd.DataFrame(regs_amount, columns =[‘First registration NL’, ‘Amount’])df[“First registration NL”] = df[‘First registration NL’].map(lambda d: datetime.datetime(d.yr, d.month, d.day))df[“Quarter”] = df[‘First registration NL’].dt.to_period(‘Q’)
data_per_quarter = df.groupby([‘Quarter’], as_index=False)[“Amount”].sum()dates = data_per_quarter[‘Quarter’]quantity = data_per_quarter[‘Amount’]
p = determine(x_axis_type=’datetime’, width=1600, top=500, title=f”Electrical automobile registrations within the Netherlands, 1992-2022″)p.vbar(x=dates, prime=quantity, width=datetime.timedelta(days=3*22), line_color=’black’)p.xaxis[0].ticker.desired_num_ticks = 20p.yaxis.axis_label = “Vehicles complete”present(p)
On this code, I first transformed the SQL consequence to the Pandas dataframe; then I transformed Python “date” objects to “datetime” (for some cause, the quarter calculation works with “datetime” solely). The code is sort of much like the primary half, however right here I take advantage of Pandas “groupby.sum()” as a substitute of “measurement()”, as a result of the information retrieved from SQL was already grouped by days.
The result’s attention-grabbing:
As was described within the first half, the primary electrical automobile was registered within the Netherlands in 1992. It was a Fiat Panda Elettra, a small two-seat automobile with a 70 km/h most velocity, 100 km vary, and 12 lead-acid 6V batteries as an influence supply. It was the one electrical automobile in the entire nation for 15 years; the three subsequent Tesla Roadster automobiles have been registered solely in 2009.
Let’s now get a value distribution of electrical automobiles. I wish to draw a field and whisker plot, and for that, I must know the minimal, most, and quartile values per mannequin:
with Session(rdw_db) as session:request_models = session.question(Automotive.mannequin).filter(Automotive.is_electric == 1).group_by(Automotive.mannequin).all()
def q0(x):return x.quantile(0.01)
def q1(x):return x.quantile(0.25)
def q3(x):return x.quantile(0.75)
def this autumn(x):return x.quantile(0.99)
models_data = {}for m in request_models:model_name = m[0] # (AIWAYS,) => AIWAYSprint(“Processing”, model_name)request_model = session.question(Automotive.value).filter((Automotive.is_electric == 1) & (Automotive.value > 0) & (Automotive.mannequin == model_name)).all()df = pd.DataFrame(request_model)agg_data = {‘value’: [‘size’, ‘min’, q0, q1, ‘median’, q3, q4, ‘max’]} models_data[model_name] = df.agg(agg_data)[“price”]
df = pd.concat(models_data, axis=1).transpose()show(df)
On this code, first I get the record of all automobile fashions; then I get the costs for every mannequin and combination these costs utilizing Pandas. Then I mix the information right into a single dataframe. The consequence appears like this:
Having this dataframe, it’s straightforward to attract a field plot:
# Type fashions by pricedf = df.sort_values(by=’median’, ascending=True)
fashions = df.index.values v_min = df[“q0”].valuesq1 = df[“q1”].valuesq3 = df[“q3”].valuesv_max = df[“q4”].values
# Drawpalette = (Inferno10 + Magma10 + Plasma10 + Viridis10)[:models.shape[0]]supply = ColumnDataSource(information=dict(fashions=fashions, backside=q1, prime=q3, colour=palette, decrease=v_min, higher=v_max))
p = determine(x_range=fashions, width=1900, top=500, title=”Electrical automobile costs distribution within the Netherlands”)whisker = Whisker(base=”fashions”, higher=”higher”, decrease=”decrease”, supply=supply)p.add_layout(whisker)p.vbar(x=’fashions’, prime=’prime’, backside=’backside’, width=0.9, colour=’colour’, line_color=”black”, supply=supply)p.left[0].formatter.use_scientific = Falsep.y_range.begin = 0show(p)
The consequence appears like this:
With the assistance of SQLAlchemy, it is usually straightforward to get all costs of electrical automobiles and construct the histogram utilizing the “np.histogram” methodology. The code is sort of the identical as within the first half; those that want can do it on their very own.
Conclusion
Analyzing the information from the actual dataset is attention-grabbing, and it turned out that SQL and Pandas work collectively fairly effectively. The “heavy lifting” of retrieving and pre-processing the information might be accomplished utilizing SQLAlchemy, then this information can be utilized in Pandas.
As for the processing of the information itself, loads of work might be accomplished. It could be attention-grabbing to mix this information with the Kaggle Electrical Automobiles dataset and discover the correlation between the utmost distance vary, value, and automobile launch date (newer fashions ought to have an extended vary). I attempted to do that, however the automobile mannequin names in each datasets don’t match, and I’m not an skilled in automobiles to do that manually for each mannequin. Additionally, as was proven earlier than, the RDW dataset wants extra cleansing, and never all of the names are constant. Readers, who’re , can proceed these experiments on their very own.
For those who loved this story, be happy to subscribe to Medium, and you’re going to get notifications when my new articles might be revealed, in addition to full entry to 1000’s of tales from different authors.
Thanks for studying.