Notebook 1 : Analyse Exploratoire¶
Import des modules¶
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from pathlib import Path
pd.set_option("display.max_columns",200)
pd.set_option("display.max_rows",200)
Analyse Exploratoire : Structurelle¶
#Chargement du fichier csv
DATA_PATH = Path("..")/"data"/"raw"/"seattle_2016_buildings.csv"
building_consumption = pd.read_csv(DATA_PATH)
building_consumption.shape
(3376, 46)
Le jeu de données contient 3376 bâtiments décrits par 46 variables.
# On regarde comment un batiment est défini dans ce jeu de données
building_consumption.head()
| OSEBuildingID | DataYear | BuildingType | PrimaryPropertyType | PropertyName | Address | City | State | ZipCode | TaxParcelIdentificationNumber | CouncilDistrictCode | Neighborhood | Latitude | Longitude | YearBuilt | NumberofBuildings | NumberofFloors | PropertyGFATotal | PropertyGFAParking | PropertyGFABuilding(s) | ListOfAllPropertyUseTypes | LargestPropertyUseType | LargestPropertyUseTypeGFA | SecondLargestPropertyUseType | SecondLargestPropertyUseTypeGFA | ThirdLargestPropertyUseType | ThirdLargestPropertyUseTypeGFA | YearsENERGYSTARCertified | ENERGYSTARScore | SiteEUI(kBtu/sf) | SiteEUIWN(kBtu/sf) | SourceEUI(kBtu/sf) | SourceEUIWN(kBtu/sf) | SiteEnergyUse(kBtu) | SiteEnergyUseWN(kBtu) | SteamUse(kBtu) | Electricity(kWh) | Electricity(kBtu) | NaturalGas(therms) | NaturalGas(kBtu) | DefaultData | Comments | ComplianceStatus | Outlier | TotalGHGEmissions | GHGEmissionsIntensity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2016 | NonResidential | Hotel | Mayflower park hotel | 405 Olive way | Seattle | WA | 98101.0 | 0659000030 | 7 | DOWNTOWN | 47.61220 | -122.33799 | 1927 | 1.0 | 12 | 88434 | 0 | 88434 | Hotel | Hotel | 88434.0 | NaN | NaN | NaN | NaN | NaN | 60.0 | 81.699997 | 84.300003 | 182.500000 | 189.000000 | 7226362.5 | 7456910.0 | 2003882.00 | 1.156514e+06 | 3946027.0 | 12764.52930 | 1276453.0 | False | NaN | Compliant | NaN | 249.98 | 2.83 |
| 1 | 2 | 2016 | NonResidential | Hotel | Paramount Hotel | 724 Pine street | Seattle | WA | 98101.0 | 0659000220 | 7 | DOWNTOWN | 47.61317 | -122.33393 | 1996 | 1.0 | 11 | 103566 | 15064 | 88502 | Hotel, Parking, Restaurant | Hotel | 83880.0 | Parking | 15064.0 | Restaurant | 4622.0 | NaN | 61.0 | 94.800003 | 97.900002 | 176.100006 | 179.399994 | 8387933.0 | 8664479.0 | 0.00 | 9.504252e+05 | 3242851.0 | 51450.81641 | 5145082.0 | False | NaN | Compliant | NaN | 295.86 | 2.86 |
| 2 | 3 | 2016 | NonResidential | Hotel | 5673-The Westin Seattle | 1900 5th Avenue | Seattle | WA | 98101.0 | 0659000475 | 7 | DOWNTOWN | 47.61393 | -122.33810 | 1969 | 1.0 | 41 | 956110 | 196718 | 759392 | Hotel | Hotel | 756493.0 | NaN | NaN | NaN | NaN | NaN | 43.0 | 96.000000 | 97.699997 | 241.899994 | 244.100006 | 72587024.0 | 73937112.0 | 21566554.00 | 1.451544e+07 | 49526664.0 | 14938.00000 | 1493800.0 | False | NaN | Compliant | NaN | 2089.28 | 2.19 |
| 3 | 5 | 2016 | NonResidential | Hotel | HOTEL MAX | 620 STEWART ST | Seattle | WA | 98101.0 | 0659000640 | 7 | DOWNTOWN | 47.61412 | -122.33664 | 1926 | 1.0 | 10 | 61320 | 0 | 61320 | Hotel | Hotel | 61320.0 | NaN | NaN | NaN | NaN | NaN | 56.0 | 110.800003 | 113.300003 | 216.199997 | 224.000000 | 6794584.0 | 6946800.5 | 2214446.25 | 8.115253e+05 | 2768924.0 | 18112.13086 | 1811213.0 | False | NaN | Compliant | NaN | 286.43 | 4.67 |
| 4 | 8 | 2016 | NonResidential | Hotel | WARWICK SEATTLE HOTEL (ID8) | 401 LENORA ST | Seattle | WA | 98121.0 | 0659000970 | 7 | DOWNTOWN | 47.61375 | -122.34047 | 1980 | 1.0 | 18 | 175580 | 62000 | 113580 | Hotel, Parking, Swimming Pool | Hotel | 123445.0 | Parking | 68009.0 | Swimming Pool | 0.0 | NaN | 75.0 | 114.800003 | 118.699997 | 211.399994 | 215.600006 | 14172606.0 | 14656503.0 | 0.00 | 1.573449e+06 | 5368607.0 | 88039.98438 | 8803998.0 | False | NaN | Compliant | NaN | 505.01 | 2.88 |
Un bâtiment est défini par :
- des colonnes d'identification/contexte
- des colonnes de Localisation
- des colonnes sur la structure du bâtiment (++++ utiles pour la prédiction)
- des colonnes sur l'usage des surfaces
- des colonnes sur l'énergie & émissions (++++ targets potentielles : SiteEnergyUseWN ou sans WN, TotalGHGEmissions ou par m². Le reste ne devra pas être retenu pour modélisation "data leakage")
- des colonnes qualité/conformité (utiles pour le nettoyage)
Les définitions détaillées des variables sont disponibles dans la documentation officielle de la ville de Seattle :
https://data.seattle.gov/Built-Environment/Building-Energy-Benchmarking-Data-2015-Present/teqw-tu6e/about_data
Bon à savoir : La ville de Seattle met à disposition des données multi-annuelles (2015-2024).
Ce projet se concentre volontairement sur une analyse statique des bâtiments à partir des données de 2016, afin de modéliser l'impact des caractéristiques structurelles sur la consommation énergétique et les émissions de CO2. Une analyse mutli-annuelle pourrait constituer une extension intéressante pour étudier les évolutions temporelles ou l'impact des politiques publiques, mais elle répondrait à une problématique différente et ne constitue pas le coeur de ce travail.
#Nb de valeurs manquantes par colonne
building_consumption.isna().sum().sort_values(ascending=False)
Comments 3376 Outlier 3344 YearsENERGYSTARCertified 3257 ThirdLargestPropertyUseType 2780 ThirdLargestPropertyUseTypeGFA 2780 SecondLargestPropertyUseType 1697 SecondLargestPropertyUseTypeGFA 1697 ENERGYSTARScore 843 LargestPropertyUseTypeGFA 20 LargestPropertyUseType 20 ZipCode 16 ListOfAllPropertyUseTypes 9 Electricity(kWh) 9 SourceEUIWN(kBtu/sf) 9 GHGEmissionsIntensity 9 TotalGHGEmissions 9 NaturalGas(therms) 9 SteamUse(kBtu) 9 NaturalGas(kBtu) 9 SourceEUI(kBtu/sf) 9 Electricity(kBtu) 9 NumberofBuildings 8 SiteEUI(kBtu/sf) 7 SiteEnergyUseWN(kBtu) 6 SiteEUIWN(kBtu/sf) 6 SiteEnergyUse(kBtu) 5 OSEBuildingID 0 PropertyName 0 PrimaryPropertyType 0 BuildingType 0 Longitude 0 Latitude 0 Neighborhood 0 CouncilDistrictCode 0 TaxParcelIdentificationNumber 0 State 0 City 0 Address 0 DataYear 0 PropertyGFATotal 0 PropertyGFABuilding(s) 0 YearBuilt 0 PropertyGFAParking 0 NumberofFloors 0 DefaultData 0 ComplianceStatus 0 dtype: int64
#on calcul le % de NaN par colonne
missing_building_consumption = (building_consumption.isna().mean()*100).sort_values(ascending=False)
missing_building_consumption = missing_building_consumption.to_frame("missing_%")
missing_building_consumption[missing_building_consumption["missing_%"]>0]
| missing_% | |
|---|---|
| Comments | 100.000000 |
| Outlier | 99.052133 |
| YearsENERGYSTARCertified | 96.475118 |
| ThirdLargestPropertyUseType | 82.345972 |
| ThirdLargestPropertyUseTypeGFA | 82.345972 |
| SecondLargestPropertyUseType | 50.266588 |
| SecondLargestPropertyUseTypeGFA | 50.266588 |
| ENERGYSTARScore | 24.970379 |
| LargestPropertyUseTypeGFA | 0.592417 |
| LargestPropertyUseType | 0.592417 |
| ZipCode | 0.473934 |
| ListOfAllPropertyUseTypes | 0.266588 |
| Electricity(kWh) | 0.266588 |
| SourceEUIWN(kBtu/sf) | 0.266588 |
| GHGEmissionsIntensity | 0.266588 |
| TotalGHGEmissions | 0.266588 |
| NaturalGas(therms) | 0.266588 |
| SteamUse(kBtu) | 0.266588 |
| NaturalGas(kBtu) | 0.266588 |
| SourceEUI(kBtu/sf) | 0.266588 |
| Electricity(kBtu) | 0.266588 |
| NumberofBuildings | 0.236967 |
| SiteEUI(kBtu/sf) | 0.207346 |
| SiteEnergyUseWN(kBtu) | 0.177725 |
| SiteEUIWN(kBtu/sf) | 0.177725 |
| SiteEnergyUse(kBtu) | 0.148104 |
L'analyse des valeurs manquantes met en évidence plusieurs catégories de variables.
- Les colonnes
Comments,OutlieretYearsENERGYSTARCertifiedprésentent un taux de valeurs manquantes proches de 100% et ne contiennent pas d'information exploitable pour la prédiction : elles seront supprimées. - Les variables décrivant les usages secondaires des baâtiments présentent un taux de valeurs manquantes élevé. Ces absences sont toutefois structurelles : tous les bâtiments ne disposent pas nécessairement de plusieurs usages.
Afin de conserver l'information sans complexifier inutilement le modèle, une variable binaire indiquant la présence d'un second usage pourra être privilégiée (via ListofAllPropertyUseTypes ou SecondLargestPropertyUseType). En revanche, les informations relatives à un troisième usage, trop rares dans le jeu de données, ne seront pas retenues.
3) En revanche, la variable ENERGYSTARScore est partiellement manquante (24.97%) mais conserve un fort intérêt métier et sera étudiée plus en détail ultérieurement.
4) Enfin, la majorité des autres variables présentent un taux de valeurs manquantes très faible, qui ne justifie pas de suppression à ce stade.
# types des variables
building_consumption.dtypes.to_frame("dtypes")
| dtypes | |
|---|---|
| OSEBuildingID | int64 |
| DataYear | int64 |
| BuildingType | str |
| PrimaryPropertyType | str |
| PropertyName | str |
| Address | str |
| City | str |
| State | str |
| ZipCode | float64 |
| TaxParcelIdentificationNumber | str |
| CouncilDistrictCode | int64 |
| Neighborhood | str |
| Latitude | float64 |
| Longitude | float64 |
| YearBuilt | int64 |
| NumberofBuildings | float64 |
| NumberofFloors | int64 |
| PropertyGFATotal | int64 |
| PropertyGFAParking | int64 |
| PropertyGFABuilding(s) | int64 |
| ListOfAllPropertyUseTypes | str |
| LargestPropertyUseType | str |
| LargestPropertyUseTypeGFA | float64 |
| SecondLargestPropertyUseType | str |
| SecondLargestPropertyUseTypeGFA | float64 |
| ThirdLargestPropertyUseType | str |
| ThirdLargestPropertyUseTypeGFA | float64 |
| YearsENERGYSTARCertified | object |
| ENERGYSTARScore | float64 |
| SiteEUI(kBtu/sf) | float64 |
| SiteEUIWN(kBtu/sf) | float64 |
| SourceEUI(kBtu/sf) | float64 |
| SourceEUIWN(kBtu/sf) | float64 |
| SiteEnergyUse(kBtu) | float64 |
| SiteEnergyUseWN(kBtu) | float64 |
| SteamUse(kBtu) | float64 |
| Electricity(kWh) | float64 |
| Electricity(kBtu) | float64 |
| NaturalGas(therms) | float64 |
| NaturalGas(kBtu) | float64 |
| DefaultData | bool |
| Comments | float64 |
| ComplianceStatus | str |
| Outlier | str |
| TotalGHGEmissions | float64 |
| GHGEmissionsIntensity | float64 |
L'analyse des types de variables ne révèle pas d'incohérences majeures. Les typages sont globalement cohérent avec la nature des données.
Bon à savoir : L'analyse exploratoire mobilise l'ensemble des types de données disponibles afin de comprendre le contexte et la qualité du jeu de données. En revanche, la phase de modélisation se concentre uniquement sur des variables quantitatives et catégorielles pertinentes, transformées de manière appropriée selon le modèle d'apprentissage utilisé.
building_consumption.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| OSEBuildingID | 3376.0 | NaN | NaN | NaN | 21208.991114 | 12223.757015 | 1.0 | 19990.75 | 23112.0 | 25994.25 | 50226.0 |
| DataYear | 3376.0 | NaN | NaN | NaN | 2016.0 | 0.0 | 2016.0 | 2016.0 | 2016.0 | 2016.0 | 2016.0 |
| BuildingType | 3376 | 8 | NonResidential | 1460 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| PrimaryPropertyType | 3376 | 24 | Low-Rise Multifamily | 987 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| PropertyName | 3376 | 3362 | Northgate Plaza | 3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Address | 3376 | 3354 | 2600 SW Barton St | 4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| City | 3376 | 1 | Seattle | 3376 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| State | 3376 | 1 | WA | 3376 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ZipCode | 3360.0 | NaN | NaN | NaN | 98116.949107 | 18.615205 | 98006.0 | 98105.0 | 98115.0 | 98122.0 | 98272.0 |
| TaxParcelIdentificationNumber | 3376 | 3268 | 1625049001 | 8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CouncilDistrictCode | 3376.0 | NaN | NaN | NaN | 4.439277 | 2.120625 | 1.0 | 3.0 | 4.0 | 7.0 | 7.0 |
| Neighborhood | 3376 | 19 | DOWNTOWN | 573 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Latitude | 3376.0 | NaN | NaN | NaN | 47.624033 | 0.047758 | 47.49917 | 47.59986 | 47.618675 | 47.657115 | 47.73387 |
| Longitude | 3376.0 | NaN | NaN | NaN | -122.334795 | 0.027203 | -122.41425 | -122.350662 | -122.332495 | -122.319407 | -122.220966 |
| YearBuilt | 3376.0 | NaN | NaN | NaN | 1968.573164 | 33.088156 | 1900.0 | 1948.0 | 1975.0 | 1997.0 | 2015.0 |
| NumberofBuildings | 3368.0 | NaN | NaN | NaN | 1.106888 | 2.108402 | 0.0 | 1.0 | 1.0 | 1.0 | 111.0 |
| NumberofFloors | 3376.0 | NaN | NaN | NaN | 4.709123 | 5.494465 | 0.0 | 2.0 | 4.0 | 5.0 | 99.0 |
| PropertyGFATotal | 3376.0 | NaN | NaN | NaN | 94833.537322 | 218837.60712 | 11285.0 | 28487.0 | 44175.0 | 90992.0 | 9320156.0 |
| PropertyGFAParking | 3376.0 | NaN | NaN | NaN | 8001.526066 | 32326.723928 | 0.0 | 0.0 | 0.0 | 0.0 | 512608.0 |
| PropertyGFABuilding(s) | 3376.0 | NaN | NaN | NaN | 86832.011256 | 207939.811923 | 3636.0 | 27756.0 | 43216.0 | 84276.25 | 9320156.0 |
| ListOfAllPropertyUseTypes | 3367 | 466 | Multifamily Housing | 866 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| LargestPropertyUseType | 3356 | 56 | Multifamily Housing | 1667 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| LargestPropertyUseTypeGFA | 3356.0 | NaN | NaN | NaN | 79177.638558 | 201703.407492 | 5656.0 | 25094.75 | 39894.0 | 76200.25 | 9320156.0 |
| SecondLargestPropertyUseType | 1679 | 50 | Parking | 976 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| SecondLargestPropertyUseTypeGFA | 1679.0 | NaN | NaN | NaN | 28444.075817 | 54392.917928 | 0.0 | 5000.0 | 10664.0 | 26640.0 | 686750.0 |
| ThirdLargestPropertyUseType | 596 | 44 | Retail Store | 110 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ThirdLargestPropertyUseTypeGFA | 596.0 | NaN | NaN | NaN | 11738.675166 | 29331.199286 | 0.0 | 2239.0 | 5043.0 | 10138.75 | 459748.0 |
| YearsENERGYSTARCertified | 119.0 | 65.0 | 2016.0 | 14.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ENERGYSTARScore | 2533.0 | NaN | NaN | NaN | 67.918674 | 26.873271 | 1.0 | 53.0 | 75.0 | 90.0 | 100.0 |
| SiteEUI(kBtu/sf) | 3369.0 | NaN | NaN | NaN | 54.732116 | 56.273124 | 0.0 | 27.9 | 38.599998 | 60.400002 | 834.400024 |
| SiteEUIWN(kBtu/sf) | 3370.0 | NaN | NaN | NaN | 57.033798 | 57.16333 | 0.0 | 29.4 | 40.900002 | 64.275002 | 834.400024 |
| SourceEUI(kBtu/sf) | 3367.0 | NaN | NaN | NaN | 134.232848 | 139.287554 | 0.0 | 74.699997 | 96.199997 | 143.899994 | 2620.0 |
| SourceEUIWN(kBtu/sf) | 3367.0 | NaN | NaN | NaN | 137.783932 | 139.109807 | -2.1 | 78.400002 | 101.099998 | 148.349998 | 2620.0 |
| SiteEnergyUse(kBtu) | 3371.0 | NaN | NaN | NaN | 5403667.294533 | 21610628.627639 | 0.0 | 925128.59375 | 1803753.25 | 4222455.25 | 873923712.0 |
| SiteEnergyUseWN(kBtu) | 3370.0 | NaN | NaN | NaN | 5276725.714395 | 15938786.484121 | 0.0 | 970182.234375 | 1904452.0 | 4381429.125 | 471613856.0 |
| SteamUse(kBtu) | 3367.0 | NaN | NaN | NaN | 274595.898209 | 3912173.392696 | 0.0 | 0.0 | 0.0 | 0.0 | 134943456.0 |
| Electricity(kWh) | 3367.0 | NaN | NaN | NaN | 1086638.966571 | 4352478.355209 | -33826.80078 | 187422.94535 | 345129.9063 | 829317.84375 | 192577488.0 |
| Electricity(kBtu) | 3367.0 | NaN | NaN | NaN | 3707612.161594 | 14850656.138963 | -115417.0 | 639487.0 | 1177583.0 | 2829632.5 | 657074389.0 |
| NaturalGas(therms) | 3367.0 | NaN | NaN | NaN | 13685.045376 | 67097.808296 | 0.0 | 0.0 | 3237.537598 | 11890.33496 | 2979090.0 |
| NaturalGas(kBtu) | 3367.0 | NaN | NaN | NaN | 1368504.541443 | 6709780.83488 | 0.0 | 0.0 | 323754.0 | 1189033.5 | 297909000.0 |
| DefaultData | 3376 | 2 | False | 3263 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Comments | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ComplianceStatus | 3376 | 4 | Compliant | 3211 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Outlier | 32 | 2 | Low outlier | 23 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| TotalGHGEmissions | 3367.0 | NaN | NaN | NaN | 119.723971 | 538.832227 | -0.8 | 9.495 | 33.92 | 93.94 | 16870.98 |
| GHGEmissionsIntensity | 3367.0 | NaN | NaN | NaN | 1.175916 | 1.821452 | -0.02 | 0.21 | 0.61 | 1.37 | 34.09 |
L'objectif du Describe, ici, est de vérifier la cohérence globale et la diversité des données.
- Cohérence temporelle des données : 2016 uniquement
- Cohérence géographique : l'ensemble des bâtiments sont situés à Seattle
- Les variables décrivant les bâtiments mettent en évidence une forte diversité, tant dans les usages (bâtiments mono-usage et multi-usage) que dans leur structure (année de construction, nombre de bâtiments, nombre d'étages, surfaces, etc.)
- Les variables liées à la consommation d'énergie et aux émissions de C02 seront analysées plus en détail par la suite
- La gestion des valeurs manquantes et des valeurs atypiques sera traitée lors de l'étape de nettoyage des données
#verification doublons
building_consumption["OSEBuildingID"].duplicated().sum()
np.int64(0)
L’unicité des observations a été vérifiée à l’aide de l’identifiant OSEBuildingID.
Aucun doublon d’identifiant n’a été détecté, garantissant que chaque ligne correspond à un bâtiment unique.
Préparation et nettoyage des données¶
Targets¶
Dans ce projet, l'objectif est de prédire la consommation énergétique totale des bâtiments SiteEnergyUse (kBtu) ainsi que leurs émissions de C02 TotalGHGEmissions, à partir de caractéristiques structurelles et d'usage des bâtiments.
Ainsi, les variables cibles retenues sont :
SiteEnergyUse (kBtu), représentant la consommation énérgétique annuelle du bâtimentTotalGHGEmissions, représentant les émissions totales de gaz à effet de serre (CO2)
#Suppression des autres colonnes énergétiques
energy_keywords = [
"Energy", "EUI", "WN", "Steam", "Electricity", "Gas", "Intensity"
]
cols_to_drop = [
col for col in building_consumption.columns
if any (keyword in col for keyword in energy_keywords) and col not in ["SiteEnergyUse(kBtu)","TotalGHGEEmissions"]
]
building_df_cleaning = building_consumption.drop(columns=cols_to_drop)
building_df_cleaning
| OSEBuildingID | DataYear | BuildingType | PrimaryPropertyType | PropertyName | Address | City | State | ZipCode | TaxParcelIdentificationNumber | CouncilDistrictCode | Neighborhood | Latitude | Longitude | YearBuilt | NumberofBuildings | NumberofFloors | PropertyGFATotal | PropertyGFAParking | PropertyGFABuilding(s) | ListOfAllPropertyUseTypes | LargestPropertyUseType | LargestPropertyUseTypeGFA | SecondLargestPropertyUseType | SecondLargestPropertyUseTypeGFA | ThirdLargestPropertyUseType | ThirdLargestPropertyUseTypeGFA | YearsENERGYSTARCertified | ENERGYSTARScore | SiteEnergyUse(kBtu) | DefaultData | Comments | ComplianceStatus | Outlier | TotalGHGEmissions | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2016 | NonResidential | Hotel | Mayflower park hotel | 405 Olive way | Seattle | WA | 98101.0 | 0659000030 | 7 | DOWNTOWN | 47.61220 | -122.33799 | 1927 | 1.0 | 12 | 88434 | 0 | 88434 | Hotel | Hotel | 88434.0 | NaN | NaN | NaN | NaN | NaN | 60.0 | 7.226362e+06 | False | NaN | Compliant | NaN | 249.98 |
| 1 | 2 | 2016 | NonResidential | Hotel | Paramount Hotel | 724 Pine street | Seattle | WA | 98101.0 | 0659000220 | 7 | DOWNTOWN | 47.61317 | -122.33393 | 1996 | 1.0 | 11 | 103566 | 15064 | 88502 | Hotel, Parking, Restaurant | Hotel | 83880.0 | Parking | 15064.0 | Restaurant | 4622.0 | NaN | 61.0 | 8.387933e+06 | False | NaN | Compliant | NaN | 295.86 |
| 2 | 3 | 2016 | NonResidential | Hotel | 5673-The Westin Seattle | 1900 5th Avenue | Seattle | WA | 98101.0 | 0659000475 | 7 | DOWNTOWN | 47.61393 | -122.33810 | 1969 | 1.0 | 41 | 956110 | 196718 | 759392 | Hotel | Hotel | 756493.0 | NaN | NaN | NaN | NaN | NaN | 43.0 | 7.258702e+07 | False | NaN | Compliant | NaN | 2089.28 |
| 3 | 5 | 2016 | NonResidential | Hotel | HOTEL MAX | 620 STEWART ST | Seattle | WA | 98101.0 | 0659000640 | 7 | DOWNTOWN | 47.61412 | -122.33664 | 1926 | 1.0 | 10 | 61320 | 0 | 61320 | Hotel | Hotel | 61320.0 | NaN | NaN | NaN | NaN | NaN | 56.0 | 6.794584e+06 | False | NaN | Compliant | NaN | 286.43 |
| 4 | 8 | 2016 | NonResidential | Hotel | WARWICK SEATTLE HOTEL (ID8) | 401 LENORA ST | Seattle | WA | 98121.0 | 0659000970 | 7 | DOWNTOWN | 47.61375 | -122.34047 | 1980 | 1.0 | 18 | 175580 | 62000 | 113580 | Hotel, Parking, Swimming Pool | Hotel | 123445.0 | Parking | 68009.0 | Swimming Pool | 0.0 | NaN | 75.0 | 1.417261e+07 | False | NaN | Compliant | NaN | 505.01 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3371 | 50222 | 2016 | Nonresidential COS | Office | Horticulture building | 1600 S Dakota St | Seattle | WA | NaN | 1624049080 | 2 | GREATER DUWAMISH | 47.56722 | -122.31154 | 1990 | 1.0 | 1 | 12294 | 0 | 12294 | Office | Office | 12294.0 | NaN | NaN | NaN | NaN | NaN | 46.0 | 8.497457e+05 | True | NaN | Error - Correct Default Data | NaN | 20.94 |
| 3372 | 50223 | 2016 | Nonresidential COS | Other | International district/Chinatown CC | 719 8th Ave S | Seattle | WA | NaN | 3558300000 | 2 | DOWNTOWN | 47.59625 | -122.32283 | 2004 | 1.0 | 1 | 16000 | 0 | 16000 | Other - Recreation | Other - Recreation | 16000.0 | NaN | NaN | NaN | NaN | NaN | NaN | 9.502762e+05 | False | NaN | Compliant | NaN | 32.17 |
| 3373 | 50224 | 2016 | Nonresidential COS | Other | Queen Anne Pool | 1920 1st Ave W | Seattle | WA | NaN | 1794501150 | 7 | MAGNOLIA / QUEEN ANNE | 47.63644 | -122.35784 | 1974 | 1.0 | 1 | 13157 | 0 | 13157 | Fitness Center/Health Club/Gym, Other - Recrea... | Other - Recreation | 7583.0 | Fitness Center/Health Club/Gym | 5574.0 | Swimming Pool | 0.0 | NaN | NaN | 5.765898e+06 | False | NaN | Compliant | NaN | 223.54 |
| 3374 | 50225 | 2016 | Nonresidential COS | Mixed Use Property | South Park Community Center | 8319 8th Ave S | Seattle | WA | NaN | 7883603155 | 1 | GREATER DUWAMISH | 47.52832 | -122.32431 | 1989 | 1.0 | 1 | 14101 | 0 | 14101 | Fitness Center/Health Club/Gym, Food Service, ... | Other - Recreation | 6601.0 | Fitness Center/Health Club/Gym | 6501.0 | Pre-school/Daycare | 484.0 | NaN | NaN | 7.194712e+05 | False | NaN | Compliant | NaN | 22.11 |
| 3375 | 50226 | 2016 | Nonresidential COS | Mixed Use Property | Van Asselt Community Center | 2820 S Myrtle St | Seattle | WA | NaN | 7857002030 | 2 | GREATER DUWAMISH | 47.53939 | -122.29536 | 1938 | 1.0 | 1 | 18258 | 0 | 18258 | Fitness Center/Health Club/Gym, Food Service, ... | Other - Recreation | 8271.0 | Fitness Center/Health Club/Gym | 8000.0 | Pre-school/Daycare | 1108.0 | NaN | NaN | 1.152896e+06 | False | NaN | Compliant | NaN | 41.27 |
3376 rows × 35 columns
Les variables directement liées aux consommations énergétiques détaillées ont été supprimées afin d’éviter toute fuite d’information.
L’objectif est de garantir que les prédictions reposent uniquement sur des caractéristiques structurelles et d’usage des bâtiments, et non sur des mesures énergétiques déjà observées.
La variable ENERGYSTARScore est conservée à ce stade pour l’analyse exploratoire, mais ne sera pas utilisée pour la modélisation. Bien qu’elle ne soit pas une composante directe de la consommation d’énergie, elle reste une conséquence de la performance énergétique globale du bâtiment.
À ce stade, le jeu de données comprend 3376 bâtiments décrits par 35 variables.
Définition du périmètre : "bâtiments non résidentiels"¶
building_df_cleaning["BuildingType"].value_counts()
BuildingType NonResidential 1460 Multifamily LR (1-4) 1018 Multifamily MR (5-9) 580 Multifamily HR (10+) 110 SPS-District K-12 98 Nonresidential COS 85 Campus 24 Nonresidential WA 1 Name: count, dtype: int64
to_keep = [
"NonResidential",
"Nonresidential COS",
"Nonresidential WA",
"SPS-District K-12",
"Campus"
]
building_df_cleaning = building_df_cleaning[
building_df_cleaning["BuildingType"].isin(to_keep)
]
building_df_cleaning["BuildingType"].value_counts()
BuildingType NonResidential 1460 SPS-District K-12 98 Nonresidential COS 85 Campus 24 Nonresidential WA 1 Name: count, dtype: int64
building_df_cleaning.shape
(1668, 35)
Les bâtiments à usage résidentiel (immeubles multifamiliaux) ont été exclus du périmètre.
Les bâtiments conservés correspondent exclusivement à des usages non résidentiel (bureaux, établissements scolaires, campus et bâtiments institutionnels).
A ce stade, le jeu de données comprend 1668 bâtiments décrits par 35 variables.
Traitement des valeurs manquantes NaN¶
# Colonnes à supprimer (> 90% de valeurs manquantes)
cols_to_drop = ["Comments", "Outlier", "YearsENERGYSTARCertified"]
building_df_cleaning = building_df_cleaning.drop(columns=cols_to_drop, errors="ignore" )
building_df_cleaning.shape
(1668, 32)
Les variables présentant plus de 90% de valeurs manquantes ont été supprimées.
# Colonnes à supprimer ("third use") car trop de NaN + alternative avec solution "mono ou multi usage"
cols_to_drop = ["ThirdLargestPropertyUseType", "ThirdLargestPropertyUseTypeGFA"]
building_df_cleaning = building_df_cleaning.drop(columns=cols_to_drop, errors="ignore")
building_df_cleaning.shape
(1668, 30)
#Création de la variable binaire "mutli-usage" en se basant sur SecondLargestPropertyUseType et SecondLargestPropertyUseTypeGFA
building_df_cleaning["has_secondary_use"] = (
building_df_cleaning["SecondLargestPropertyUseType"].notna() |
building_df_cleaning["SecondLargestPropertyUseTypeGFA"].notna()
).astype(int)
building_df_cleaning
| OSEBuildingID | DataYear | BuildingType | PrimaryPropertyType | PropertyName | Address | City | State | ZipCode | TaxParcelIdentificationNumber | CouncilDistrictCode | Neighborhood | Latitude | Longitude | YearBuilt | NumberofBuildings | NumberofFloors | PropertyGFATotal | PropertyGFAParking | PropertyGFABuilding(s) | ListOfAllPropertyUseTypes | LargestPropertyUseType | LargestPropertyUseTypeGFA | SecondLargestPropertyUseType | SecondLargestPropertyUseTypeGFA | ENERGYSTARScore | SiteEnergyUse(kBtu) | DefaultData | ComplianceStatus | TotalGHGEmissions | has_secondary_use | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2016 | NonResidential | Hotel | Mayflower park hotel | 405 Olive way | Seattle | WA | 98101.0 | 0659000030 | 7 | DOWNTOWN | 47.61220 | -122.33799 | 1927 | 1.0 | 12 | 88434 | 0 | 88434 | Hotel | Hotel | 88434.0 | NaN | NaN | 60.0 | 7.226362e+06 | False | Compliant | 249.98 | 0 |
| 1 | 2 | 2016 | NonResidential | Hotel | Paramount Hotel | 724 Pine street | Seattle | WA | 98101.0 | 0659000220 | 7 | DOWNTOWN | 47.61317 | -122.33393 | 1996 | 1.0 | 11 | 103566 | 15064 | 88502 | Hotel, Parking, Restaurant | Hotel | 83880.0 | Parking | 15064.0 | 61.0 | 8.387933e+06 | False | Compliant | 295.86 | 1 |
| 2 | 3 | 2016 | NonResidential | Hotel | 5673-The Westin Seattle | 1900 5th Avenue | Seattle | WA | 98101.0 | 0659000475 | 7 | DOWNTOWN | 47.61393 | -122.33810 | 1969 | 1.0 | 41 | 956110 | 196718 | 759392 | Hotel | Hotel | 756493.0 | NaN | NaN | 43.0 | 7.258702e+07 | False | Compliant | 2089.28 | 0 |
| 3 | 5 | 2016 | NonResidential | Hotel | HOTEL MAX | 620 STEWART ST | Seattle | WA | 98101.0 | 0659000640 | 7 | DOWNTOWN | 47.61412 | -122.33664 | 1926 | 1.0 | 10 | 61320 | 0 | 61320 | Hotel | Hotel | 61320.0 | NaN | NaN | 56.0 | 6.794584e+06 | False | Compliant | 286.43 | 0 |
| 4 | 8 | 2016 | NonResidential | Hotel | WARWICK SEATTLE HOTEL (ID8) | 401 LENORA ST | Seattle | WA | 98121.0 | 0659000970 | 7 | DOWNTOWN | 47.61375 | -122.34047 | 1980 | 1.0 | 18 | 175580 | 62000 | 113580 | Hotel, Parking, Swimming Pool | Hotel | 123445.0 | Parking | 68009.0 | 75.0 | 1.417261e+07 | False | Compliant | 505.01 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3371 | 50222 | 2016 | Nonresidential COS | Office | Horticulture building | 1600 S Dakota St | Seattle | WA | NaN | 1624049080 | 2 | GREATER DUWAMISH | 47.56722 | -122.31154 | 1990 | 1.0 | 1 | 12294 | 0 | 12294 | Office | Office | 12294.0 | NaN | NaN | 46.0 | 8.497457e+05 | True | Error - Correct Default Data | 20.94 | 0 |
| 3372 | 50223 | 2016 | Nonresidential COS | Other | International district/Chinatown CC | 719 8th Ave S | Seattle | WA | NaN | 3558300000 | 2 | DOWNTOWN | 47.59625 | -122.32283 | 2004 | 1.0 | 1 | 16000 | 0 | 16000 | Other - Recreation | Other - Recreation | 16000.0 | NaN | NaN | NaN | 9.502762e+05 | False | Compliant | 32.17 | 0 |
| 3373 | 50224 | 2016 | Nonresidential COS | Other | Queen Anne Pool | 1920 1st Ave W | Seattle | WA | NaN | 1794501150 | 7 | MAGNOLIA / QUEEN ANNE | 47.63644 | -122.35784 | 1974 | 1.0 | 1 | 13157 | 0 | 13157 | Fitness Center/Health Club/Gym, Other - Recrea... | Other - Recreation | 7583.0 | Fitness Center/Health Club/Gym | 5574.0 | NaN | 5.765898e+06 | False | Compliant | 223.54 | 1 |
| 3374 | 50225 | 2016 | Nonresidential COS | Mixed Use Property | South Park Community Center | 8319 8th Ave S | Seattle | WA | NaN | 7883603155 | 1 | GREATER DUWAMISH | 47.52832 | -122.32431 | 1989 | 1.0 | 1 | 14101 | 0 | 14101 | Fitness Center/Health Club/Gym, Food Service, ... | Other - Recreation | 6601.0 | Fitness Center/Health Club/Gym | 6501.0 | NaN | 7.194712e+05 | False | Compliant | 22.11 | 1 |
| 3375 | 50226 | 2016 | Nonresidential COS | Mixed Use Property | Van Asselt Community Center | 2820 S Myrtle St | Seattle | WA | NaN | 7857002030 | 2 | GREATER DUWAMISH | 47.53939 | -122.29536 | 1938 | 1.0 | 1 | 18258 | 0 | 18258 | Fitness Center/Health Club/Gym, Food Service, ... | Other - Recreation | 8271.0 | Fitness Center/Health Club/Gym | 8000.0 | NaN | 1.152896e+06 | False | Compliant | 41.27 | 1 |
1668 rows × 31 columns
# Colonnes à supprimer ("second use")
cols_to_drop = ["SecondLargestPropertyUseType", "SecondLargestPropertyUseTypeGFA"]
building_df_cleaning = building_df_cleaning.drop(columns=cols_to_drop, errors="ignore")
building_df_cleaning.shape
(1668, 29)
Les informations relatives aux usages tertiaires des bâtiments, très peu représentées dans le jeu de données, ont été supprimées afin de limiter la complexité et le bruit potentiel.
La présence d’un second usage a en revanche été conservée sous la forme d’une variable binaire indiquant si le bâtiment est mono-usage ou multi-usage. Ce choix permet de capturer une information structurelle pertinente tout en évitant une gestion complexe des valeurs manquantes et des catégories.
À ce stade, le jeu de données comprend 1 668 bâtiments décrits par 29 variables.
#traitement des valeurs manquantes de ENERGYSTARScore (environ 25% de NaN soit 1/4 du jeu de données) par imputation
median_energystar = building_df_cleaning["ENERGYSTARScore"].median()
building_df_cleaning["ENERGYSTARScore"] = building_df_cleaning["ENERGYSTARScore"].fillna(median_energystar)
building_df_cleaning
| OSEBuildingID | DataYear | BuildingType | PrimaryPropertyType | PropertyName | Address | City | State | ZipCode | TaxParcelIdentificationNumber | CouncilDistrictCode | Neighborhood | Latitude | Longitude | YearBuilt | NumberofBuildings | NumberofFloors | PropertyGFATotal | PropertyGFAParking | PropertyGFABuilding(s) | ListOfAllPropertyUseTypes | LargestPropertyUseType | LargestPropertyUseTypeGFA | ENERGYSTARScore | SiteEnergyUse(kBtu) | DefaultData | ComplianceStatus | TotalGHGEmissions | has_secondary_use | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2016 | NonResidential | Hotel | Mayflower park hotel | 405 Olive way | Seattle | WA | 98101.0 | 0659000030 | 7 | DOWNTOWN | 47.61220 | -122.33799 | 1927 | 1.0 | 12 | 88434 | 0 | 88434 | Hotel | Hotel | 88434.0 | 60.0 | 7.226362e+06 | False | Compliant | 249.98 | 0 |
| 1 | 2 | 2016 | NonResidential | Hotel | Paramount Hotel | 724 Pine street | Seattle | WA | 98101.0 | 0659000220 | 7 | DOWNTOWN | 47.61317 | -122.33393 | 1996 | 1.0 | 11 | 103566 | 15064 | 88502 | Hotel, Parking, Restaurant | Hotel | 83880.0 | 61.0 | 8.387933e+06 | False | Compliant | 295.86 | 1 |
| 2 | 3 | 2016 | NonResidential | Hotel | 5673-The Westin Seattle | 1900 5th Avenue | Seattle | WA | 98101.0 | 0659000475 | 7 | DOWNTOWN | 47.61393 | -122.33810 | 1969 | 1.0 | 41 | 956110 | 196718 | 759392 | Hotel | Hotel | 756493.0 | 43.0 | 7.258702e+07 | False | Compliant | 2089.28 | 0 |
| 3 | 5 | 2016 | NonResidential | Hotel | HOTEL MAX | 620 STEWART ST | Seattle | WA | 98101.0 | 0659000640 | 7 | DOWNTOWN | 47.61412 | -122.33664 | 1926 | 1.0 | 10 | 61320 | 0 | 61320 | Hotel | Hotel | 61320.0 | 56.0 | 6.794584e+06 | False | Compliant | 286.43 | 0 |
| 4 | 8 | 2016 | NonResidential | Hotel | WARWICK SEATTLE HOTEL (ID8) | 401 LENORA ST | Seattle | WA | 98121.0 | 0659000970 | 7 | DOWNTOWN | 47.61375 | -122.34047 | 1980 | 1.0 | 18 | 175580 | 62000 | 113580 | Hotel, Parking, Swimming Pool | Hotel | 123445.0 | 75.0 | 1.417261e+07 | False | Compliant | 505.01 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3371 | 50222 | 2016 | Nonresidential COS | Office | Horticulture building | 1600 S Dakota St | Seattle | WA | NaN | 1624049080 | 2 | GREATER DUWAMISH | 47.56722 | -122.31154 | 1990 | 1.0 | 1 | 12294 | 0 | 12294 | Office | Office | 12294.0 | 46.0 | 8.497457e+05 | True | Error - Correct Default Data | 20.94 | 0 |
| 3372 | 50223 | 2016 | Nonresidential COS | Other | International district/Chinatown CC | 719 8th Ave S | Seattle | WA | NaN | 3558300000 | 2 | DOWNTOWN | 47.59625 | -122.32283 | 2004 | 1.0 | 1 | 16000 | 0 | 16000 | Other - Recreation | Other - Recreation | 16000.0 | 73.0 | 9.502762e+05 | False | Compliant | 32.17 | 0 |
| 3373 | 50224 | 2016 | Nonresidential COS | Other | Queen Anne Pool | 1920 1st Ave W | Seattle | WA | NaN | 1794501150 | 7 | MAGNOLIA / QUEEN ANNE | 47.63644 | -122.35784 | 1974 | 1.0 | 1 | 13157 | 0 | 13157 | Fitness Center/Health Club/Gym, Other - Recrea... | Other - Recreation | 7583.0 | 73.0 | 5.765898e+06 | False | Compliant | 223.54 | 1 |
| 3374 | 50225 | 2016 | Nonresidential COS | Mixed Use Property | South Park Community Center | 8319 8th Ave S | Seattle | WA | NaN | 7883603155 | 1 | GREATER DUWAMISH | 47.52832 | -122.32431 | 1989 | 1.0 | 1 | 14101 | 0 | 14101 | Fitness Center/Health Club/Gym, Food Service, ... | Other - Recreation | 6601.0 | 73.0 | 7.194712e+05 | False | Compliant | 22.11 | 1 |
| 3375 | 50226 | 2016 | Nonresidential COS | Mixed Use Property | Van Asselt Community Center | 2820 S Myrtle St | Seattle | WA | NaN | 7857002030 | 2 | GREATER DUWAMISH | 47.53939 | -122.29536 | 1938 | 1.0 | 1 | 18258 | 0 | 18258 | Fitness Center/Health Club/Gym, Food Service, ... | Other - Recreation | 8271.0 | 73.0 | 1.152896e+06 | False | Compliant | 41.27 | 1 |
1668 rows × 29 columns
La valeur ENERGYSTARScore présente environ 25% de valeurs manquantes.
Les valeurs manquantes ont été imputées par la médiane (ici 73.0) afin de conserver l'ensemble des observations.
#afficher les lignes avec SiteEnergyUse manquants
building_df_cleaning[building_df_cleaning["SiteEnergyUse(kBtu)"].isna()]
| OSEBuildingID | DataYear | BuildingType | PrimaryPropertyType | PropertyName | Address | City | State | ZipCode | TaxParcelIdentificationNumber | CouncilDistrictCode | Neighborhood | Latitude | Longitude | YearBuilt | NumberofBuildings | NumberofFloors | PropertyGFATotal | PropertyGFAParking | PropertyGFABuilding(s) | ListOfAllPropertyUseTypes | LargestPropertyUseType | LargestPropertyUseTypeGFA | ENERGYSTARScore | SiteEnergyUse(kBtu) | DefaultData | ComplianceStatus | TotalGHGEmissions | has_secondary_use | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 578 | 773 | 2016 | NonResidential | Small- and Mid-Sized Office | SEATTLE BUILDING | 215 COLUMBIA ST | Seattle | WA | 98104.0 | 0939000245 | 7 | DOWNTOWN | 47.60380 | -122.33293 | 1924 | NaN | 4 | 63150 | 0 | 63150 | NaN | NaN | NaN | 73.0 | NaN | False | Non-Compliant | NaN | 0 |
| 2670 | 26532 | 2016 | NonResidential | Mixed Use Property | KALBERG BUILDING | 4515 UNIVERSITY WAY NE | Seattle | WA | 98105.0 | 8816401120 | 4 | NORTHEAST | 47.66182 | -122.31345 | 1928 | NaN | 2 | 20760 | 0 | 20760 | NaN | NaN | NaN | 73.0 | NaN | False | Non-Compliant | NaN | 0 |
#suppression de ces 2 lignes
building_df_cleaning = building_df_cleaning.dropna(
subset=["SiteEnergyUse(kBtu)", "TotalGHGEmissions"]
)
building_df_cleaning.shape
(1666, 29)
2 bâtiments présentant une valeur manquante pour les variables cibles SiteEnergyUse(kBtu) et TotalGHGEmissions ont été examinées.
Ainsi, ces lignes ne disposent pas d'informations exploitables et représentent une part négligeable du jeu de données.
Elles ont donc été supprimées afin de garantir la qualité des données utilisées pour la modélisation.
A ce stade, le jeu de données comprend 1666 observations et 29 variables.
#valeurs manquantes restants :
missing_building_df_cleaning = (building_df_cleaning.isna().mean()*100).sort_values(ascending=False)
missing_building_df_cleaning = missing_building_df_cleaning.to_frame("missing_%")
missing_building_df_cleaning[missing_building_df_cleaning["missing_%"]>0]
| missing_% | |
|---|---|
| ZipCode | 0.960384 |
| LargestPropertyUseType | 0.240096 |
| LargestPropertyUseTypeGFA | 0.240096 |
#afficher les lignes avec LargestPropertyUseType manquants
building_df_cleaning[building_df_cleaning["LargestPropertyUseType"].isna()]
| OSEBuildingID | DataYear | BuildingType | PrimaryPropertyType | PropertyName | Address | City | State | ZipCode | TaxParcelIdentificationNumber | CouncilDistrictCode | Neighborhood | Latitude | Longitude | YearBuilt | NumberofBuildings | NumberofFloors | PropertyGFATotal | PropertyGFAParking | PropertyGFABuilding(s) | ListOfAllPropertyUseTypes | LargestPropertyUseType | LargestPropertyUseTypeGFA | ENERGYSTARScore | SiteEnergyUse(kBtu) | DefaultData | ComplianceStatus | TotalGHGEmissions | has_secondary_use | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 353 | 496 | 2016 | NonResidential | Self-Storage Facility | Market St Center | 2811 NW Market Street | Seattle | WA | 98107.0 | 1175001235 | 6 | BALLARD | 47.66838 | -122.39310 | 1946 | 2.0 | 2 | 111445 | 0 | 111445 | Fitness Center/Health Club/Gym, Office, Other ... | NaN | NaN | 73.0 | 5.697472e+06 | False | Compliant | 163.83 | 0 |
| 1147 | 21103 | 2016 | NonResidential | Hotel | Palladian Hotel | 2000 Second Avenue | Seattle | WA | 98121.0 | 1977201140 | 7 | DOWNTOWN | 47.61203 | -122.34165 | 1910 | 1.0 | 8 | 61721 | 0 | 61721 | Hotel | NaN | NaN | 93.0 | 2.897080e+06 | False | Compliant | 36.92 | 0 |
| 2414 | 25568 | 2016 | NonResidential | Small- and Mid-Sized Office | Talon Northlake LLC | 1341 N Northlake Way | Seattle | WA | 98103.0 | 4088804565 | 4 | LAKE UNION | 47.64747 | -122.34086 | 2008 | 1.0 | 4 | 48350 | 0 | 48350 | Office | NaN | NaN | 45.0 | 3.168131e+06 | False | Compliant | 22.09 | 0 |
| 2459 | 25711 | 2016 | NonResidential | Restaurant | BUSH GARDEN - RESTURANT & LOUNGE | 614 S MAYNARD AVE S | Seattle | WA | 98104.0 | 5247802410 | 2 | DOWNTOWN | 47.59697 | -122.32474 | 1913 | 1.0 | 3 | 28800 | 0 | 28800 | Restaurant | NaN | NaN | 73.0 | 8.999242e+05 | False | Compliant | 29.21 | 0 |
#suppression de ces 4 lignes
building_df_cleaning = building_df_cleaning.dropna(
subset=["LargestPropertyUseTypeGFA"]
)
building_df_cleaning.shape
(1662, 29)
Les valeurs manquantes de LargestPropertyUseType et de LargestPropertyUseTypeGFA concernent seulement 4 bâtiments.
L'usage principale peut être déduit par ListOfAllPropertyUseType, on pourrait donc procéder à une imputation manuelle ciblée.
Cependant, pour ces 4 bâtiments, la surface associée à l'usage principal ne peut pas être estimée de manière fiable.
Ces observations ont donc été supprimées afin de garantir la cohérence et la qualité des données utilisées pour la modélisation.
A ce stade, le jeu de données comprend 1662 bâtiments décrits par 29 variables.
#afficher les lignes avec ZipCode manquants
building_df_cleaning[building_df_cleaning["ZipCode"].isna()]
| OSEBuildingID | DataYear | BuildingType | PrimaryPropertyType | PropertyName | Address | City | State | ZipCode | TaxParcelIdentificationNumber | CouncilDistrictCode | Neighborhood | Latitude | Longitude | YearBuilt | NumberofBuildings | NumberofFloors | PropertyGFATotal | PropertyGFAParking | PropertyGFABuilding(s) | ListOfAllPropertyUseTypes | LargestPropertyUseType | LargestPropertyUseTypeGFA | ENERGYSTARScore | SiteEnergyUse(kBtu) | DefaultData | ComplianceStatus | TotalGHGEmissions | has_secondary_use | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3360 | 50196 | 2016 | Nonresidential COS | Mixed Use Property | Northgate Community Center | 10510 5th Ave NE | Seattle | WA | NaN | 2926049431 | 5 | NORTH | 47.70541 | -122.32232 | 2005 | 1.0 | 1 | 20616 | 0 | 20616 | Fitness Center/Health Club/Gym, Office, Other ... | Other - Recreation | 9900.0 | 73.0 | 6.369655e+05 | False | Compliant | 4.44 | 1 |
| 3361 | 50198 | 2016 | Nonresidential COS | Other | Fire Station 06 (New) | 405 MLK Jr Way S | Seattle | WA | NaN | 1250200565 | 3 | CENTRAL | 47.59905 | -122.29787 | 2013 | 1.0 | 1 | 11685 | 0 | 11685 | Prison/Incarceration | Prison/Incarceration | 11685.0 | 73.0 | 8.510538e+05 | False | Compliant | 29.18 | 0 |
| 3362 | 50201 | 2016 | Nonresidential COS | Other | Fire Station 35 (New) | 8729 15th Ave NW | Seattle | WA | NaN | 3300700810 | 6 | BALLARD | 47.69330 | -122.37717 | 2010 | 1.0 | 1 | 11968 | 0 | 11968 | Prison/Incarceration | Prison/Incarceration | 11968.0 | 73.0 | 7.834230e+05 | False | Compliant | 23.00 | 0 |
| 3363 | 50204 | 2016 | Nonresidential COS | Other | Fire Station 39 (New) | 2806 NE 127th St | Seattle | WA | NaN | 3834500066 | 5 | NORTH | 47.72126 | -122.29735 | 1949 | 1.0 | 1 | 11285 | 0 | 11285 | Prison/Incarceration | Prison/Incarceration | 11285.0 | 73.0 | 6.456654e+05 | False | Compliant | 14.37 | 0 |
| 3364 | 50207 | 2016 | Nonresidential COS | Other | Ballard Community Center | 6020 28th ave NW | Seattle | WA | NaN | 6658000065 | 6 | BALLARD | 47.67295 | -122.39228 | 1911 | 1.0 | 1 | 16795 | 0 | 16795 | Fitness Center/Health Club/Gym, Food Service, ... | Other - Recreation | 8680.0 | 73.0 | 9.366165e+05 | False | Compliant | 24.73 | 1 |
| 3365 | 50208 | 2016 | Nonresidential COS | Other | Ballard Pool | 1471 NW 67th St | Seattle | WA | NaN | 3050700005 | 6 | BALLARD | 47.67734 | -122.37624 | 1972 | 1.0 | 1 | 12769 | 0 | 12769 | Fitness Center/Health Club/Gym, Office, Other ... | Other - Recreation | 10912.0 | 73.0 | 5.117308e+06 | False | Compliant | 216.18 | 1 |
| 3366 | 50210 | 2016 | Nonresidential COS | Office | Central West HQ / Brown Bear | 1403 w howe | Seattle | WA | NaN | 2425039137 | 7 | MAGNOLIA / QUEEN ANNE | 47.63572 | -122.37525 | 1952 | 1.0 | 1 | 13661 | 0 | 13661 | Office | Office | 13661.0 | 75.0 | 5.026677e+05 | True | Error - Correct Default Data | 3.50 | 0 |
| 3367 | 50212 | 2016 | Nonresidential COS | Other | Conservatory Campus | 1400 E Galer St | Seattle | WA | NaN | 2925049087 | 3 | EAST | 47.63228 | -122.31574 | 1912 | 1.0 | 1 | 23445 | 0 | 23445 | Other - Recreation | Other - Recreation | 23445.0 | 73.0 | 5.976246e+06 | False | Compliant | 259.22 | 0 |
| 3368 | 50219 | 2016 | Nonresidential COS | Mixed Use Property | Garfield Community Center | 2323 East Cherry St | Seattle | WA | NaN | 7544800245 | 3 | CENTRAL | 47.60775 | -122.30225 | 1994 | 1.0 | 1 | 20050 | 0 | 20050 | Fitness Center/Health Club/Gym, Office, Other ... | Other - Recreation | 8108.0 | 73.0 | 1.813404e+06 | False | Compliant | 60.81 | 1 |
| 3369 | 50220 | 2016 | Nonresidential COS | Office | Genesee/SC SE HQ | 4420 S Genesee | Seattle | WA | NaN | 4154300585 | 2 | SOUTHEAST | 47.56440 | -122.27813 | 1960 | 1.0 | 1 | 15398 | 0 | 15398 | Office | Office | 15398.0 | 93.0 | 3.878100e+05 | True | Error - Correct Default Data | 7.79 | 0 |
| 3370 | 50221 | 2016 | Nonresidential COS | Other | High Point Community Center | 6920 34th Ave SW | Seattle | WA | NaN | 2524039059 | 1 | DELRIDGE NEIGHBORHOODS | 47.54067 | -122.37441 | 1982 | 1.0 | 1 | 18261 | 0 | 18261 | Other - Recreation | Other - Recreation | 18261.0 | 73.0 | 9.320821e+05 | False | Compliant | 20.33 | 0 |
| 3371 | 50222 | 2016 | Nonresidential COS | Office | Horticulture building | 1600 S Dakota St | Seattle | WA | NaN | 1624049080 | 2 | GREATER DUWAMISH | 47.56722 | -122.31154 | 1990 | 1.0 | 1 | 12294 | 0 | 12294 | Office | Office | 12294.0 | 46.0 | 8.497457e+05 | True | Error - Correct Default Data | 20.94 | 0 |
| 3372 | 50223 | 2016 | Nonresidential COS | Other | International district/Chinatown CC | 719 8th Ave S | Seattle | WA | NaN | 3558300000 | 2 | DOWNTOWN | 47.59625 | -122.32283 | 2004 | 1.0 | 1 | 16000 | 0 | 16000 | Other - Recreation | Other - Recreation | 16000.0 | 73.0 | 9.502762e+05 | False | Compliant | 32.17 | 0 |
| 3373 | 50224 | 2016 | Nonresidential COS | Other | Queen Anne Pool | 1920 1st Ave W | Seattle | WA | NaN | 1794501150 | 7 | MAGNOLIA / QUEEN ANNE | 47.63644 | -122.35784 | 1974 | 1.0 | 1 | 13157 | 0 | 13157 | Fitness Center/Health Club/Gym, Other - Recrea... | Other - Recreation | 7583.0 | 73.0 | 5.765898e+06 | False | Compliant | 223.54 | 1 |
| 3374 | 50225 | 2016 | Nonresidential COS | Mixed Use Property | South Park Community Center | 8319 8th Ave S | Seattle | WA | NaN | 7883603155 | 1 | GREATER DUWAMISH | 47.52832 | -122.32431 | 1989 | 1.0 | 1 | 14101 | 0 | 14101 | Fitness Center/Health Club/Gym, Food Service, ... | Other - Recreation | 6601.0 | 73.0 | 7.194712e+05 | False | Compliant | 22.11 | 1 |
| 3375 | 50226 | 2016 | Nonresidential COS | Mixed Use Property | Van Asselt Community Center | 2820 S Myrtle St | Seattle | WA | NaN | 7857002030 | 2 | GREATER DUWAMISH | 47.53939 | -122.29536 | 1938 | 1.0 | 1 | 18258 | 0 | 18258 | Fitness Center/Health Club/Gym, Food Service, ... | Other - Recreation | 8271.0 | 73.0 | 1.152896e+06 | False | Compliant | 41.27 | 1 |
#check rapide (100% : Seattle + WA)
building_df_cleaning["City"].value_counts()
building_df_cleaning["State"].value_counts()
building_df_cleaning["DataYear"].value_counts()
DataYear 2016 1662 Name: count, dtype: int64
#suppression de certaines colonnes non pertinentes :
cols_to_drop= ["ZipCode","Address","City","State","DataYear"]
building_df_cleaning = building_df_cleaning.drop(columns=cols_to_drop)
building_df_cleaning.shape
(1662, 24)
La variable ZipCode présente peu de valeurs manquantes, mais n'apporte pas d'information discriminante supplémentaire dans un contexte mono-ville, d'autant plus que les coordonnées géographiques sont disponibles. Elle a donc été retirée.
Pareillement pour la variable Address.
Les variables City, State et DataYear ne présentent aucune variabilité (On a Seattle, WA, 2016 pour l'ensemble des observations). Elles n'apportent donc aucune information utile à l'analyse ou la modélisation. Elles ont été supprimées.
A ce stade, le jeu de données comprend 1662 bâtiments décrits par 24 variables.
Nettoyage sémantique et vérification de la qualité des données¶
PrimaryPropertyType vs LargestPropertyUseType
pd.crosstab(
building_df_cleaning["LargestPropertyUseType"],
building_df_cleaning["PrimaryPropertyType"],
)
| PrimaryPropertyType | Distribution Center | Hospital | Hotel | K-12 School | Laboratory | Large Office | Low-Rise Multifamily | Medical Office | Mixed Use Property | Office | Other | Refrigerated Warehouse | Residence Hall | Restaurant | Retail Store | Self-Storage Facility | Senior Care Community | Small- and Mid-Sized Office | Supermarket / Grocery Store | University | Warehouse | Worship Facility |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LargestPropertyUseType | ||||||||||||||||||||||
| Adult Education | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Automobile Dealership | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Bank Branch | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| College/University | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 24 | 0 | 0 |
| Convention Center | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Courthouse | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Data Center | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Distribution Center | 53 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Financial Office | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Fire Station | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Fitness Center/Health Club/Gym | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Food Service | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Hospital (General Medical & Surgical) | 0 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Hotel | 0 | 0 | 75 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| K-12 School | 0 | 0 | 0 | 139 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Laboratory | 0 | 0 | 0 | 0 | 10 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Library | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Lifestyle Center | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Manufacturing/Industrial Plant | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Medical Office | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 39 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Movie Theater | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Multifamily Housing | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Museum | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Non-Refrigerated Warehouse | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 187 | 0 |
| Office | 0 | 0 | 0 | 0 | 0 | 173 | 0 | 0 | 31 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 290 | 0 | 0 | 0 | 0 |
| Other | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 9 | 0 | 88 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Other - Education | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Other - Entertainment/Public Assembly | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 18 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Other - Lodging/Residential | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Other - Mall | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Other - Public Services | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Other - Recreation | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 26 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Other - Restaurant/Bar | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Other - Services | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Other - Utility | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Other/Specialty Hospital | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Parking | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 14 | 0 | 16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Performing Arts | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Personal Services (Health/Beauty, Dry Cleaning, etc) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Police Station | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Pre-school/Daycare | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Prison/Incarceration | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Refrigerated Warehouse | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Repair Services (Vehicle, Shoe, Locksmith, etc) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Residence Hall/Dormitory | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 21 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Residential Care Facility | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Restaurant | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Retail Store | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 9 | 0 | 0 | 0 | 0 | 0 | 90 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Self-Storage Facility | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 27 | 0 | 0 | 0 | 0 | 0 | 0 |
| Senior Care Community | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20 | 0 | 0 | 0 | 0 | 0 |
| Social/Meeting Hall | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Strip Mall | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Supermarket/Grocery Store | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 39 | 0 | 0 | 0 |
| Urgent Care/Clinic/Other Outpatient | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Wholesale Club/Supercenter | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Worship Facility | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 71 |
Deux variables décrivent l'usage des bâtiments : PrimaryPropertyType (plus agrégée) et LargestPropertyUseType (plus granulaire).
Afin de ne pas exclure prématurément une information potentiellement pertinente, les deux variables sont conservées à ce stade et analysées lors de l'EDA.
Le choix final de la variable utilisée en modélisation sera effectué après analyse exploratoire.
#suppression de TaxParcelIdentificationNumber
building_df_cleaning=building_df_cleaning.drop(columns=["TaxParcelIdentificationNumber"], errors="ignore")
building_df_cleaning.shape
(1662, 23)
Suppression de TaxParcelIdentificationNumber : identifiant fiscal aucun lien avec notre objectif
A ce stade, le jeu de données comprend 1662 bâtiments décrit par 23 colonnes.
évaluation de la qualité des données :
building_df_cleaning["DefaultData"].value_counts()
DefaultData False 1574 True 88 Name: count, dtype: int64
building_df_cleaning["ComplianceStatus"].value_counts()
ComplianceStatus Compliant 1544 Error - Correct Default Data 88 Non-Compliant 16 Missing Data 14 Name: count, dtype: int64
#afficher les lignes avec ComplianceStatus= Non-Compliant
building_df_cleaning[building_df_cleaning["ComplianceStatus"]=='Non-Compliant']
| OSEBuildingID | BuildingType | PrimaryPropertyType | PropertyName | CouncilDistrictCode | Neighborhood | Latitude | Longitude | YearBuilt | NumberofBuildings | NumberofFloors | PropertyGFATotal | PropertyGFAParking | PropertyGFABuilding(s) | ListOfAllPropertyUseTypes | LargestPropertyUseType | LargestPropertyUseTypeGFA | ENERGYSTARScore | SiteEnergyUse(kBtu) | DefaultData | ComplianceStatus | TotalGHGEmissions | has_secondary_use | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 226 | 350 | NonResidential | Large Office | Second And Spring Building | 7 | DOWNTOWN | 47.60642 | -122.33581 | 1958 | 1.0 | 6 | 172842 | 25920 | 146922 | Data Center, Office, Parking | Office | 99890.0 | 73.0 | 4.139950e+07 | False | Non-Compliant | 362.66 | 1 |
| 304 | 435 | NonResidential | Other | Washington State Convention Center | 7 | DOWNTOWN | 47.61195 | -122.33167 | 1990 | 1.0 | 6 | 1400000 | 0 | 1400000 | Convention Center, Parking | Convention Center | 1072000.0 | 73.0 | 0.000000e+00 | False | Non-Compliant | 0.00 | 1 |
| 384 | 539 | NonResidential | Retail Store | University Center | 4 | NORTHEAST | 47.66178 | -122.31812 | 1987 | 1.0 | 2 | 69492 | 0 | 69492 | Retail Store | Retail Store | 69800.0 | 100.0 | 3.189628e+05 | False | Non-Compliant | 2.22 | 1 |
| 448 | 608 | NonResidential | Large Office | 411 1ST AVE S (ID608) | 2 | DOWNTOWN | 47.59878 | -122.33458 | 1913 | 5.0 | 7 | 154159 | 0 | 154159 | Office | Office | 193154.0 | 100.0 | 1.119592e+07 | False | Non-Compliant | 29.43 | 0 |
| 517 | 704 | NonResidential | Large Office | 401 Elliott Ave West | 7 | MAGNOLIA / QUEEN ANNE | 47.62235 | -122.36378 | 2000 | 1.0 | 4 | 129551 | 42500 | 87051 | Data Center, Office, Parking | Office | 82273.0 | 1.0 | 2.713719e+07 | False | Non-Compliant | 189.18 | 1 |
| 1229 | 21315 | NonResidential | Small- and Mid-Sized Office | 1518 Fifith Ave | 7 | DOWNTOWN | 47.61119 | -122.33581 | 1903 | 1.0 | 3 | 57720 | 0 | 57720 | Office | Office | 25000.0 | 73.0 | 2.410550e+04 | False | Non-Compliant | 0.17 | 0 |
| 1295 | 21474 | NonResidential | Other | The Lusty Lady | 7 | DOWNTOWN | 47.60711 | -122.33886 | 1900 | 1.0 | 3 | 49760 | 0 | 49760 | Other | Other | 24019.0 | 73.0 | 4.429350e+04 | False | Non-Compliant | 0.31 | 0 |
| 1611 | 22830 | NonResidential | Worship Facility | Freedom Church | 1 | SOUTHWEST | 47.51709 | -122.37797 | 1971 | 1.0 | 1 | 23772 | 0 | 23772 | Worship Facility | Worship Facility | 23772.0 | 100.0 | 1.008417e+05 | False | Non-Compliant | 0.70 | 0 |
| 1945 | 23912 | NonResidential | Small- and Mid-Sized Office | 1416 S Jackson | 3 | CENTRAL | 47.59973 | -122.31331 | 1947 | 1.0 | 1 | 45068 | 0 | 45068 | Office | Office | 45068.0 | 100.0 | 2.848573e+05 | False | Non-Compliant | 8.59 | 0 |
| 2129 | 24547 | NonResidential | K-12 School | Islamic School of Seattle | 3 | CENTRAL | 47.60885 | -122.29990 | 1929 | 1.0 | 2 | 24152 | 0 | 24152 | K-12 School | K-12 School | 24152.0 | 100.0 | 1.613634e+05 | False | Non-Compliant | 1.12 | 0 |
| 2189 | 24717 | NonResidential | Other | 1701 First Ave South LLC | 2 | GREATER DUWAMISH | 47.58788 | -122.33458 | 1910 | 1.0 | 3 | 27690 | 0 | 27690 | Other, Parking | Other | 24717.0 | 73.0 | 1.680890e+04 | False | Non-Compliant | 0.12 | 1 |
| 2216 | 24825 | NonResidential | Small- and Mid-Sized Office | 2233 Building | 2 | GREATER DUWAMISH | 47.58292 | -122.33468 | 1910 | 1.0 | 2 | 20970 | 0 | 20970 | Office, Parking | Office | 20970.0 | 100.0 | 2.044991e+05 | False | Non-Compliant | 5.43 | 1 |
| 2410 | 25553 | NonResidential | Hotel | J & M HOTEL BUILDING (ID25553) | 7 | DOWNTOWN | 47.60035 | -122.33379 | 1900 | 1.0 | 3 | 25450 | 0 | 25450 | Hotel | Hotel | 25450.0 | 99.0 | 5.037447e+05 | False | Non-Compliant | 3.51 | 0 |
| 2450 | 25674 | NonResidential | Low-Rise Multifamily | (ID25674) COMET TAVERN | 3 | EAST | 47.61427 | -122.31977 | 1910 | 1.0 | 3 | 32100 | 0 | 32100 | Bar/Nightclub, Multifamily Housing | Multifamily Housing | 21400.0 | 73.0 | 1.082004e+05 | False | Non-Compliant | 5.22 | 1 |
| 2801 | 27007 | NonResidential | Worship Facility | Seattle Community Church | 4 | NORTHEAST | 47.66146 | -122.27880 | 1954 | 1.0 | 2 | 20039 | 0 | 20039 | Worship Facility | Worship Facility | 20039.0 | 100.0 | 1.047223e+05 | False | Non-Compliant | 0.73 | 0 |
| 3152 | 43948 | Nonresidential COS | Other | Georgetown Steamplant | 2 | GREATER DUWAMISH | 47.54277 | -122.31626 | 1906 | 1.0 | 2 | 39212 | 0 | 39212 | Other | Other | 39212.0 | 73.0 | 7.237040e+04 | False | Non-Compliant | 0.50 | 0 |
#suppression des bâtiments "non compliant"
building_df_cleaning=building_df_cleaning[
building_df_cleaning["ComplianceStatus"] != "Non-Compliant"
]
building_df_cleaning["ComplianceStatus"].value_counts()
ComplianceStatus Compliant 1544 Error - Correct Default Data 88 Missing Data 14 Name: count, dtype: int64
#suppression des colonnes informatives sur la qualité des données :
cols_to_drop= ["DefaultData","ComplianceStatus"]
building_df_cleaning = building_df_cleaning.drop(columns=cols_to_drop)
building_df_cleaning.shape
(1646, 21)
La variable DefaultData indique que 88 observations présentent des données par défaut (non mesuré). De plus, ComplianceStatus montre que ces cas correspondent à la modalité "Error - Correct Default Data", ce qui signifie que les erreurs ont été corrigées.
La majorité des bâtiments (1544) sont conformes (Compliant), ce qui confirme une bonne qualité globale du jeu de données.
Les 14 observations associées à la modalité "Missing Data" ont été traitées lors des étapes de gestion des valeurs manquantes et ne sont plus pris en compte.
En revanche, 16 bâtiments sont identifiés comme "Non-Compliant". Ces bâtiments ne respectent pas la réglementation en vigueur, ce qui implique que les données associées peuvent être juridiquement ou techniquement douteuses, et donc potentiellement non fiables. Etant donné le faible nombre de ces observations et le risque de bruit qu'elles peuvent introduire dans l'analyse et la modélisation, elles ont été supprimées du jeu de données.
Les variables DefaultData et ComplianceStatus, utilisées uniquement pour évaluer la qualité des données et filtrer les observations non fiables, ont été supprimées du jeu de données final.
A ce stade, le jeu de données comprend 1646 bâtiments décrit par 21 variables.
building_df_cleaning.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| OSEBuildingID | 1646.0 | NaN | NaN | NaN | 16318.59356 | 13851.959964 | 1.0 | 581.25 | 21139.0 | 24598.75 | 50226.0 |
| BuildingType | 1646 | 5 | NonResidential | 1439 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| PrimaryPropertyType | 1646 | 22 | Small- and Mid-Sized Office | 287 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| PropertyName | 1646 | 1642 | South Park | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CouncilDistrictCode | 1646.0 | NaN | NaN | NaN | 4.355407 | 2.191318 | 1.0 | 2.0 | 4.0 | 7.0 | 7.0 |
| Neighborhood | 1646 | 19 | DOWNTOWN | 352 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Latitude | 1646.0 | NaN | NaN | NaN | 47.616132 | 0.048309 | 47.49917 | 47.58528 | 47.61248 | 47.649757 | 47.73387 |
| Longitude | 1646.0 | NaN | NaN | NaN | -122.332913 | 0.024592 | -122.41182 | -122.343365 | -122.332895 | -122.321742 | -122.25864 |
| YearBuilt | 1646.0 | NaN | NaN | NaN | 1962.244836 | 32.595451 | 1900.0 | 1930.0 | 1966.0 | 1989.0 | 2015.0 |
| NumberofBuildings | 1646.0 | NaN | NaN | NaN | 1.167679 | 2.947537 | 0.0 | 1.0 | 1.0 | 1.0 | 111.0 |
| NumberofFloors | 1646.0 | NaN | NaN | NaN | 4.131835 | 6.603048 | 0.0 | 1.0 | 2.0 | 4.0 | 99.0 |
| PropertyGFATotal | 1646.0 | NaN | NaN | NaN | 118835.638518 | 297552.218867 | 11285.0 | 29548.5 | 49489.5 | 105775.0 | 9320156.0 |
| PropertyGFAParking | 1646.0 | NaN | NaN | NaN | 13028.802552 | 42524.812642 | 0.0 | 0.0 | 0.0 | 0.0 | 512608.0 |
| PropertyGFABuilding(s) | 1646.0 | NaN | NaN | NaN | 105806.835966 | 284229.426942 | 3636.0 | 28507.75 | 47391.5 | 94936.0 | 9320156.0 |
| ListOfAllPropertyUseTypes | 1646 | 370 | K-12 School | 134 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| LargestPropertyUseType | 1646 | 55 | Office | 491 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| LargestPropertyUseTypeGFA | 1646.0 | NaN | NaN | NaN | 98624.53949 | 276941.773441 | 5656.0 | 25627.25 | 43960.0 | 92004.5 | 9320156.0 |
| ENERGYSTARScore | 1646.0 | NaN | NaN | NaN | 67.899757 | 23.322119 | 1.0 | 62.0 | 73.0 | 81.0 | 100.0 |
| SiteEnergyUse(kBtu) | 1646.0 | NaN | NaN | NaN | 8483145.048846 | 30402484.655871 | 0.0 | 1248602.46875 | 2572872.875 | 6950048.0 | 873923712.0 |
| TotalGHGEmissions | 1646.0 | NaN | NaN | NaN | 186.697193 | 756.305156 | -0.8 | 20.3625 | 50.015 | 144.1425 | 16870.98 |
| has_secondary_use | 1646.0 | NaN | NaN | NaN | 0.515188 | 0.499921 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
EDA : Analyse exploratoire des données¶
EDA univarié : Analyse des variables cibles¶
#statistique descriptive
building_df_cleaning[["SiteEnergyUse(kBtu)", "TotalGHGEmissions"]].describe()
| SiteEnergyUse(kBtu) | TotalGHGEmissions | |
|---|---|---|
| count | 1.646000e+03 | 1646.000000 |
| mean | 8.483145e+06 | 186.697193 |
| std | 3.040248e+07 | 756.305156 |
| min | 0.000000e+00 | -0.800000 |
| 25% | 1.248602e+06 | 20.362500 |
| 50% | 2.572873e+06 | 50.015000 |
| 75% | 6.950048e+06 | 144.142500 |
| max | 8.739237e+08 | 16870.980000 |
building_df_cleaning["TotalGHGEmissions"]=(
building_df_cleaning["TotalGHGEmissions"].clip(lower=0)
)
building_df_cleaning["TotalGHGEmissions"].describe()
count 1646.000000 mean 186.697679 std 756.305035 min 0.000000 25% 20.362500 50% 50.015000 75% 144.142500 max 16870.980000 Name: TotalGHGEmissions, dtype: float64
La valeur TotalGHGEmissions présente une valeur minimale légèrement négative (-0.8), non interprétable physiquement.
Cette valeur est considérée comme un artefact de mesure et a été ramenée à 0 afin de garantir la cohérence physique des données.
#histogrammes
fig, axes = plt.subplots(1, 2, figsize=(14,5))
sns.histplot(building_df_cleaning["SiteEnergyUse(kBtu)"], bins=50, ax=axes[0])
axes[0].set_title("Distribution de SiteEnergyUse (kBtu)")
sns.histplot(building_df_cleaning["TotalGHGEmissions"], bins=50, ax=axes[1])
axes[1].set_title("Distribution de TotalGHGEmissions")
plt.tight_layout()
plt.show()
Les distributions de SiteEnergyUse et TotalGHGEmissions sont fortement asymétriques, avec une majorité de bâtiments présentant des niveaux de consommation et d’émissions relativement faibles, et une minorité de bâtiments très énergivores.
threshold_energy = building_df_cleaning["SiteEnergyUse(kBtu)"].quantile(0.99)
building_df_cleaning = building_df_cleaning[
building_df_cleaning["SiteEnergyUse(kBtu)"] <= threshold_energy
]
building_df_cleaning.shape
(1629, 21)
Les variables cibles présentent une forte asymétrie à droite et de nombreux outliers. Afin de limiter l’influence des valeurs extrêmes tout en conservant la majorité des bâtiments représentatifs, un seuil basé sur le 99e percentile a été appliqué sur les variables de consommation énergétique et d’émissions de GES. Cette approche permet de supprimer uniquement les observations les plus extrêmes (17 bâtiments), sans altérer significativement la distribution globale des données.
fig, axes = plt.subplots(1, 2, figsize=(14,5))
sns.boxplot(x=building_df_cleaning["SiteEnergyUse(kBtu)"], ax=axes[0])
axes[0].set_title("Boxplot SiteEnergyUse (kBtu)")
sns.boxplot(x=building_df_cleaning["TotalGHGEmissions"], ax=axes[1])
axes[1].set_title("Boxplot TotalGHGEmissions")
plt.show()
Les boxplots mettent en évidence une forte asymétrie des variables cibles, avec de nombreux points considérés comme outliers au sens statistique.
Une transformation logarithmique sera envisagée lors de la phase de modélisation afin de réduire l'impact de l'asymétrie.
fig, axes = plt.subplots(1, 2, figsize=(14,5))
sns.histplot(
np.log1p(building_df_cleaning["SiteEnergyUse(kBtu)"]),
bins=50,
ax=axes[0]
)
axes[0].set_title("Distribution log(SiteEnergyUse + 1)")
sns.histplot(
np.log1p(building_df_cleaning["TotalGHGEmissions"]),
bins=50,
ax=axes[1]
)
axes[1].set_title("Distribution log(TotalGHGEmissions + 1)")
plt.tight_layout()
plt.show()
Une visualisation des variables cibles en échelle logarithmique a été réalisée afin d’évaluer l’intérêt potentiel d’une transformation avant modélisation.
Cette analyse permet d’anticiper la gestion de l’asymétrie observée dans les distributions, sans appliquer de transformation définitive à ce stade.
Une transformation logarithmique (log(x + 1)) permet d’obtenir des distributions plus symétriques et mieux structurées, ce qui suggère qu’une modélisation en échelle logarithmique est pertinente.
building_df_cleaning[
["OSEBuildingID", "PrimaryPropertyType", "LargestPropertyUseType",
"PropertyGFATotal", "NumberofFloors", "NumberofBuildings","SiteEnergyUse(kBtu)"]
].sort_values(by="SiteEnergyUse(kBtu)", ascending=False).head(15)
| OSEBuildingID | PrimaryPropertyType | LargestPropertyUseType | PropertyGFATotal | NumberofFloors | NumberofBuildings | SiteEnergyUse(kBtu) | |
|---|---|---|---|---|---|---|---|
| 2 | 3 | Hotel | Hotel | 956110 | 41 | 1.0 | 72587024.0 |
| 231 | 355 | Large Office | Office | 617684 | 42 | 1.0 | 69519808.0 |
| 98 | 147 | Hospital | Hospital (General Medical & Surgical) | 285333 | 5 | 4.0 | 68090728.0 |
| 119 | 338 | Other | Other | 299070 | 11 | 1.0 | 65336980.0 |
| 166 | 267 | Hotel | Hotel | 934292 | 0 | 1.0 | 65047284.0 |
| 474 | 637 | Mixed Use Property | Parking | 286000 | 6 | 1.0 | 63835192.0 |
| 187 | 295 | Other | Other | 230880 | 15 | 1.0 | 62197176.0 |
| 329 | 466 | Other | Urgent Care/Clinic/Other Outpatient | 158738 | 8 | 1.0 | 61762380.0 |
| 233 | 357 | Large Office | Office | 1354987 | 63 | 1.0 | 61576184.0 |
| 155 | 245 | Other | Other - Entertainment/Public Assembly | 1585960 | 6 | 1.0 | 59757440.0 |
| 158 | 249 | Other | Other - Entertainment/Public Assembly | 1172127 | 3 | 1.0 | 58761304.0 |
| 3187 | 49732 | Hospital | Hospital (General Medical & Surgical) | 330000 | 8 | 1.0 | 57764408.0 |
| 273 | 402 | Large Office | Office | 1536606 | 46 | 1.0 | 56606136.0 |
| 490 | 659 | Large Office | Office | 1592914 | 42 | 1.0 | 56498868.0 |
| 262 | 389 | Hotel | Hotel | 542305 | 12 | 1.0 | 56485204.0 |
building_df_cleaning[
["OSEBuildingID", "PrimaryPropertyType", "LargestPropertyUseType",
"PropertyGFATotal", "NumberofFloors", "NumberofBuildings","TotalGHGEmissions"]
].sort_values(by="TotalGHGEmissions", ascending=False).head(10)
| OSEBuildingID | PrimaryPropertyType | LargestPropertyUseType | PropertyGFATotal | NumberofFloors | NumberofBuildings | TotalGHGEmissions | |
|---|---|---|---|---|---|---|---|
| 262 | 389 | Hotel | Hotel | 542305 | 12 | 1.0 | 2573.75 |
| 3156 | 45927 | Laboratory | Laboratory | 178000 | 8 | 1.0 | 2549.47 |
| 2 | 3 | Hotel | Hotel | 956110 | 41 | 1.0 | 2089.28 |
| 119 | 338 | Other | Other | 299070 | 11 | 1.0 | 2055.82 |
| 98 | 147 | Hospital | Hospital (General Medical & Surgical) | 285333 | 5 | 4.0 | 1990.50 |
| 3187 | 49732 | Hospital | Hospital (General Medical & Surgical) | 330000 | 8 | 1.0 | 1789.69 |
| 59 | 84 | Senior Care Community | Senior Care Community | 217603 | 5 | 1.0 | 1727.11 |
| 21 | 27 | Other | Other | 385274 | 19 | 1.0 | 1699.45 |
| 166 | 267 | Hotel | Hotel | 934292 | 0 | 1.0 | 1638.46 |
| 373 | 525 | Mixed Use Property | Other - Entertainment/Public Assembly | 154660 | 6 | 1.0 | 1623.34 |
building_df_cleaning[
["OSEBuildingID", "PrimaryPropertyType", "LargestPropertyUseType",
"PropertyGFATotal", "NumberofFloors", "NumberofBuildings","SiteEnergyUse(kBtu)"]
].sort_values(by="SiteEnergyUse(kBtu)", ascending=True).head(17)
| OSEBuildingID | PrimaryPropertyType | LargestPropertyUseType | PropertyGFATotal | NumberofFloors | NumberofBuildings | SiteEnergyUse(kBtu) | |
|---|---|---|---|---|---|---|---|
| 630 | 850 | K-12 School | K-12 School | 55353 | 3 | 1.0 | 0.00000 |
| 746 | 19776 | Other | Other - Education | 29924 | 1 | 1.0 | 0.00000 |
| 62 | 87 | K-12 School | K-12 School | 53352 | 2 | 1.0 | 0.00000 |
| 81 | 118 | K-12 School | K-12 School | 74468 | 3 | 1.0 | 0.00000 |
| 95 | 140 | K-12 School | K-12 School | 66588 | 3 | 1.0 | 0.00000 |
| 1361 | 21616 | K-12 School | K-12 School | 42292 | 1 | 1.0 | 0.00000 |
| 85 | 122 | K-12 School | K-12 School | 58933 | 2 | 1.0 | 0.00000 |
| 1894 | 23722 | K-12 School | K-12 School | 39971 | 1 | 1.0 | 0.00000 |
| 139 | 227 | K-12 School | K-12 School | 136188 | 3 | 1.0 | 0.00000 |
| 31 | 37 | K-12 School | K-12 School | 51582 | 2 | 1.0 | 0.00000 |
| 28 | 34 | K-12 School | K-12 School | 126351 | 1 | 1.0 | 0.00000 |
| 152 | 242 | K-12 School | K-12 School | 52792 | 2 | 1.0 | 0.00000 |
| 3166 | 49703 | K-12 School | K-12 School | 116101 | 1 | 1.0 | 0.00000 |
| 614 | 820 | K-12 School | K-12 School | 52924 | 1 | 1.0 | 0.00000 |
| 133 | 217 | K-12 School | K-12 School | 160270 | 1 | 1.0 | 0.00000 |
| 1577 | 22548 | Self-Storage Facility | Self-Storage Facility | 39952 | 3 | 1.0 | 57133.19922 |
| 3009 | 27869 | Warehouse | Non-Refrigerated Warehouse | 23040 | 2 | 1.0 | 79711.79688 |
building_df_cleaning[
["OSEBuildingID", "PrimaryPropertyType", "LargestPropertyUseType",
"PropertyGFATotal", "NumberofFloors", "NumberofBuildings","TotalGHGEmissions"]
].sort_values(by="TotalGHGEmissions", ascending=True).head(10)
| OSEBuildingID | PrimaryPropertyType | LargestPropertyUseType | PropertyGFATotal | NumberofFloors | NumberofBuildings | TotalGHGEmissions | |
|---|---|---|---|---|---|---|---|
| 513 | 700 | Supermarket / Grocery Store | Supermarket/Grocery Store | 57176 | 1 | 1.0 | 0.00 |
| 1361 | 21616 | K-12 School | K-12 School | 42292 | 1 | 1.0 | 0.00 |
| 746 | 19776 | Other | Other - Education | 29924 | 1 | 1.0 | 0.00 |
| 3206 | 49784 | Small- and Mid-Sized Office | Office | 52000 | 6 | 1.0 | 0.00 |
| 28 | 34 | K-12 School | K-12 School | 126351 | 1 | 1.0 | 0.00 |
| 152 | 242 | K-12 School | K-12 School | 52792 | 2 | 1.0 | 0.00 |
| 1577 | 22548 | Self-Storage Facility | Self-Storage Facility | 39952 | 3 | 1.0 | 0.40 |
| 974 | 20396 | Warehouse | Non-Refrigerated Warehouse | 33300 | 5 | 1.0 | 0.63 |
| 1576 | 22547 | Self-Storage Facility | Self-Storage Facility | 30989 | 3 | 1.0 | 0.68 |
| 2205 | 24778 | Warehouse | Non-Refrigerated Warehouse | 24617 | 1 | 1.0 | 0.75 |
L’examen des observations extrêmes montre qu’il s’agit principalement d’universités, d’hôpitaux, d'hôtels ou encore de data centers,
des typologies connues pour leur forte intensité énergétique.
À l’inverse, les valeurs les plus faibles correspondent à des écoles, entrepôts ou installations de stockage, ce qui est cohérent avec leur usage.
Ces valeurs extrêmes ne sont donc pas considérées comme des anomalies mais comme des observations informatives.
k12_df = building_df_cleaning[
building_df_cleaning["LargestPropertyUseType"] == "K-12 School"
]
k12_df.shape
(138, 21)
(k12_df["SiteEnergyUse(kBtu)"] == 0).sum()
np.int64(14)
k12_df.sort_values(
by="SiteEnergyUse(kBtu)",
ascending=True
).head(18)
| OSEBuildingID | BuildingType | PrimaryPropertyType | PropertyName | CouncilDistrictCode | Neighborhood | Latitude | Longitude | YearBuilt | NumberofBuildings | NumberofFloors | PropertyGFATotal | PropertyGFAParking | PropertyGFABuilding(s) | ListOfAllPropertyUseTypes | LargestPropertyUseType | LargestPropertyUseTypeGFA | ENERGYSTARScore | SiteEnergyUse(kBtu) | TotalGHGEmissions | has_secondary_use | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 28 | 34 | SPS-District K-12 | K-12 School | Meany Building | 3 | Central | 47.62266 | -122.30547 | 1955 | 1.0 | 1 | 126351 | 0 | 126351 | K-12 School | K-12 School | 126351.0 | 73.0 | 0.0000 | 0.00 | 0 |
| 31 | 37 | SPS-District K-12 | K-12 School | John Hay Elementary | 7 | MAGNOLIA / QUEEN ANNE | 47.63290 | -122.35172 | 1989 | 1.0 | 2 | 51582 | 0 | 51582 | K-12 School | K-12 School | 55166.0 | 73.0 | 0.0000 | 10.43 | 0 |
| 81 | 118 | SPS-District K-12 | K-12 School | Pathfinder K-8 | 1 | DELRIDGE | 47.56360 | -122.35800 | 1999 | 1.0 | 3 | 74468 | 0 | 74468 | K-12 School | K-12 School | 75364.0 | 73.0 | 0.0000 | 11.84 | 0 |
| 62 | 87 | SPS-District K-12 | K-12 School | Arbor Heights Elementary | 1 | SOUTHWEST | 47.50970 | -122.37759 | 1948 | 1.0 | 2 | 53352 | 0 | 53352 | K-12 School | K-12 School | 65568.0 | 73.0 | 0.0000 | 4.19 | 0 |
| 85 | 122 | SPS-District K-12 | K-12 School | John Muir Elementary | 2 | SOUTHEAST | 47.57324 | -122.29058 | 1991 | 1.0 | 2 | 58933 | 0 | 58933 | K-12 School | K-12 School | 60725.0 | 73.0 | 0.0000 | 16.36 | 0 |
| 95 | 140 | SPS-District K-12 | K-12 School | B.F. Day Elementary | 6 | LAKE UNION | 47.65464 | -122.34912 | 1991 | 1.0 | 3 | 66588 | 0 | 66588 | K-12 School | K-12 School | 66588.0 | 73.0 | 0.0000 | 14.67 | 0 |
| 133 | 217 | SPS-District K-12 | K-12 School | Whitman Middle | 6 | BALLARD | 47.69675 | -122.37760 | 1959 | 1.0 | 1 | 160270 | 0 | 160270 | K-12 School | K-12 School | 160270.0 | 73.0 | 0.0000 | 229.38 | 0 |
| 139 | 227 | SPS-District K-12 | K-12 School | Washington Middle | 3 | CENTRAL | 47.59796 | -122.30415 | 1963 | 1.0 | 3 | 136188 | 0 | 136188 | K-12 School | K-12 School | 136188.0 | 73.0 | 0.0000 | 170.90 | 0 |
| 630 | 850 | SPS-District K-12 | K-12 School | Leschi Elementary | 3 | CENTRAL | 47.60210 | -122.29181 | 1988 | 1.0 | 3 | 55353 | 0 | 55353 | K-12 School | K-12 School | 55353.0 | 73.0 | 0.0000 | 9.99 | 0 |
| 614 | 820 | SPS-District K-12 | K-12 School | Bailey Gatzert Elementary | 3 | CENTRAL | 47.60120 | -122.31548 | 1988 | 1.0 | 1 | 52924 | 0 | 52924 | K-12 School | K-12 School | 52924.0 | 73.0 | 0.0000 | 13.64 | 0 |
| 152 | 242 | SPS-District K-12 | K-12 School | Olympic View Elementary | 5 | NORTH | 47.69823 | -122.32126 | 1989 | 1.0 | 2 | 52792 | 0 | 52792 | K-12 School | K-12 School | 55480.0 | 73.0 | 0.0000 | 0.00 | 0 |
| 1894 | 23722 | SPS-District K-12 | K-12 School | North Beach Elementary | 6 | BALLARD | 47.69497 | -122.38704 | 1958 | 1.0 | 1 | 39971 | 0 | 39971 | K-12 School | K-12 School | 40867.0 | 73.0 | 0.0000 | 50.22 | 0 |
| 1361 | 21616 | SPS-District K-12 | K-12 School | Olympic Hills Elementary | 5 | NORTH | 47.72369 | -122.30676 | 1954 | 1.0 | 1 | 42292 | 0 | 42292 | K-12 School | K-12 School | 43188.0 | 100.0 | 0.0000 | 0.00 | 0 |
| 3166 | 49703 | SPS-District K-12 | K-12 School | Catharine Blaine K-8 | 7 | MAGNOLIA / QUEEN ANNE | 47.64342 | -122.39970 | 1952 | 1.0 | 1 | 116101 | 0 | 116101 | K-12 School | K-12 School | 119685.0 | 73.0 | 0.0000 | 265.21 | 0 |
| 839 | 19967 | SPS-District K-12 | K-12 School | Queen Anne Gym | 7 | MAGNOLIA / QUEEN ANNE | 47.63203 | -122.35337 | 2001 | 1.0 | 1 | 35805 | 0 | 35805 | K-12 School | K-12 School | 35805.0 | 100.0 | 431471.6875 | 11.54 | 0 |
| 1943 | 23956 | NonResidential | K-12 School | St. Catherine School | 5 | NORTH | 47.69061 | -122.31992 | 1931 | 1.0 | 2 | 23923 | 0 | 23923 | K-12 School | K-12 School | 23923.0 | 69.0 | 805643.5000 | 19.51 | 0 |
| 896 | 20168 | NonResidential | K-12 School | Spruce Street School | 7 | DOWNTOWN | 47.61687 | -122.33533 | 1995 | 1.0 | 3 | 22860 | 0 | 22860 | K-12 School | K-12 School | 22860.0 | 43.0 | 931148.6250 | 6.49 | 0 |
| 1664 | 23040 | NonResidential | K-12 School | Holy Family School | 1 | DELRIDGE | 47.51698 | -122.35946 | 1924 | 1.0 | 2 | 42975 | 0 | 42975 | K-12 School | K-12 School | 21405.0 | 92.0 | 959352.0000 | 42.96 | 0 |
building_df_cleaning.groupby("LargestPropertyUseType")[
"SiteEnergyUse(kBtu)"
].describe().T
| LargestPropertyUseType | Adult Education | Automobile Dealership | Bank Branch | College/University | Courthouse | Data Center | Distribution Center | Financial Office | Fire Station | Fitness Center/Health Club/Gym | Food Service | Hospital (General Medical & Surgical) | Hotel | K-12 School | Laboratory | Library | Lifestyle Center | Manufacturing/Industrial Plant | Medical Office | Movie Theater | Multifamily Housing | Museum | Non-Refrigerated Warehouse | Office | Other | Other - Education | Other - Entertainment/Public Assembly | Other - Lodging/Residential | Other - Mall | Other - Public Services | Other - Recreation | Other - Restaurant/Bar | Other - Services | Other - Utility | Other/Specialty Hospital | Parking | Performing Arts | Personal Services (Health/Beauty, Dry Cleaning, etc) | Police Station | Pre-school/Daycare | Prison/Incarceration | Refrigerated Warehouse | Repair Services (Vehicle, Shoe, Locksmith, etc) | Residence Hall/Dormitory | Residential Care Facility | Restaurant | Retail Store | Self-Storage Facility | Senior Care Community | Social/Meeting Hall | Strip Mall | Supermarket/Grocery Store | Urgent Care/Clinic/Other Outpatient | Wholesale Club/Supercenter | Worship Facility |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2.000000e+00 | 5.000000e+00 | 4.000000e+00 | 2.100000e+01 | 1.0 | 2.000000e+00 | 5.400000e+01 | 4.000000e+00 | 1.0 | 5.000000e+00 | 1.0 | 4.000000e+00 | 7.400000e+01 | 1.380000e+02 | 1.300000e+01 | 4.000000e+00 | 2.000000e+00 | 8.000000e+00 | 4.000000e+01 | 1.00 | 1.100000e+01 | 5.000000e+00 | 1.990000e+02 | 4.880000e+02 | 9.400000e+01 | 4.000000e+00 | 2.100000e+01 | 5.000000e+00 | 4.000000e+00 | 2.000000e+00 | 3.100000e+01 | 2.000000e+00 | 5.000000e+00 | 2.000000e+00 | 4.000000e+00 | 2.900000e+01 | 3.000000e+00 | 1.0 | 1.0 | 2.000000e+00 | 3.000000 | 1.200000e+01 | 6.000000e+00 | 2.200000e+01 | 1.0 | 1.200000e+01 | 9.800000e+01 | 2.700000e+01 | 2.000000e+01 | 1.000000e+01 | 6.000000e+00 | 4.100000e+01 | 4.000000e+00 | 1.0 | 6.900000e+01 |
| mean | 2.279209e+06 | 1.784132e+06 | 1.378787e+06 | 1.170304e+07 | 44984468.0 | 3.585956e+07 | 2.381789e+06 | 3.669635e+06 | 3095882.0 | 4.875209e+06 | 287763.5 | 4.265062e+07 | 1.125199e+07 | 2.785912e+06 | 2.736661e+07 | 5.625178e+06 | 1.952247e+07 | 3.047635e+06 | 9.523894e+06 | 2786648.75 | 5.095527e+06 | 6.159460e+06 | 1.977226e+06 | 7.605451e+06 | 8.496363e+06 | 3.677596e+06 | 1.269429e+07 | 1.908489e+06 | 1.525270e+07 | 4.033281e+06 | 5.210461e+06 | 6.467534e+06 | 2.297308e+06 | 3.492330e+06 | 7.131417e+06 | 1.343405e+07 | 2.027393e+06 | 2442220.0 | 12086616.0 | 1.852828e+06 | 760047.395833 | 3.719428e+06 | 2.255493e+06 | 3.954950e+06 | 2280352.5 | 6.695283e+06 | 4.668812e+06 | 7.556800e+05 | 1.151729e+07 | 2.550231e+06 | 8.468648e+06 | 9.656153e+06 | 2.527676e+07 | 13424079.0 | 1.202896e+06 |
| std | 1.148594e+06 | 9.776483e+05 | 1.778176e+05 | 1.518279e+07 | NaN | 8.232243e+06 | 3.763009e+06 | 2.806185e+06 | NaN | 3.226966e+06 | NaN | 2.901170e+07 | 1.368966e+07 | 2.602151e+06 | 1.602017e+07 | 8.644412e+06 | 9.103537e+06 | 3.196027e+06 | 8.532517e+06 | NaN | 7.605120e+06 | 5.426606e+06 | 3.590963e+06 | 1.072441e+07 | 1.236213e+07 | 4.978954e+06 | 1.947824e+07 | 9.608536e+05 | 2.103002e+07 | 3.995510e+06 | 5.637523e+06 | 1.882219e+06 | 1.421219e+06 | 2.804439e+06 | 1.999310e+06 | 1.673921e+07 | 1.007660e+06 | NaN | NaN | 4.332340e+05 | 104670.515033 | 5.106499e+06 | 2.763257e+06 | 3.555216e+06 | NaN | 4.234638e+06 | 6.145271e+06 | 4.882451e+05 | 1.126398e+07 | 2.043434e+06 | 7.322327e+06 | 3.700073e+06 | 2.768364e+07 | NaN | 9.651320e+05 |
| min | 1.467030e+06 | 5.441724e+05 | 1.159807e+06 | 3.237394e+05 | 44984468.0 | 3.003849e+07 | 1.501678e+05 | 1.857155e+06 | 3095882.0 | 2.133798e+06 | 287763.5 | 2.037721e+06 | 7.162797e+05 | 0.000000e+00 | 7.251589e+06 | 1.082920e+06 | 1.308530e+07 | 2.222559e+05 | 8.174096e+05 | 2786648.75 | 7.474569e+05 | 8.106369e+05 | 7.971180e+04 | 3.427261e+05 | 1.887457e+05 | 0.000000e+00 | 3.474377e+05 | 1.207504e+06 | 1.247362e+06 | 1.208029e+06 | 4.526125e+05 | 5.136604e+06 | 1.174384e+05 | 1.509292e+06 | 4.262316e+06 | 7.293972e+05 | 8.737111e+05 | 2442220.0 | 12086616.0 | 1.546485e+06 | 645665.375000 | 6.426452e+05 | 2.096449e+05 | 6.051307e+05 | 2280352.5 | 1.358519e+06 | 1.454688e+05 | 5.713320e+04 | 1.076733e+06 | 3.069111e+05 | 1.094941e+06 | 2.671351e+05 | 3.371660e+06 | 13424079.0 | 2.161150e+05 |
| 25% | 1.873119e+06 | 1.310237e+06 | 1.290026e+06 | 2.839882e+06 | 44984468.0 | 3.294903e+07 | 6.787984e+05 | 1.857957e+06 | 3095882.0 | 2.662054e+06 | 287763.5 | 3.254165e+07 | 3.778054e+06 | 1.403497e+06 | 1.159412e+07 | 1.242494e+06 | 1.630388e+07 | 1.464717e+06 | 2.072042e+06 | 2786648.75 | 1.206734e+06 | 2.010455e+06 | 5.990908e+05 | 1.543170e+06 | 1.696297e+06 | 4.503877e+05 | 1.257775e+06 | 1.242666e+06 | 4.128653e+06 | 2.620655e+06 | 1.386394e+06 | 5.802069e+06 | 1.623657e+06 | 2.500811e+06 | 6.514463e+06 | 2.762846e+06 | 1.673507e+06 | 2442220.0 | 12086616.0 | 1.699656e+06 | 714544.187500 | 1.029709e+06 | 4.340821e+05 | 1.604003e+06 | 2280352.5 | 3.671040e+06 | 1.250047e+06 | 4.479211e+05 | 5.034142e+06 | 1.417151e+06 | 2.858959e+06 | 7.619445e+06 | 3.969989e+06 | 13424079.0 | 6.087554e+05 |
| 50% | 2.279209e+06 | 1.922953e+06 | 1.386993e+06 | 4.116872e+06 | 44984468.0 | 3.585956e+07 | 1.145233e+06 | 2.524726e+06 | 3095882.0 | 3.505497e+06 | 287763.5 | 5.023702e+07 | 6.551026e+06 | 1.982446e+06 | 3.326841e+07 | 1.414466e+06 | 1.952247e+07 | 2.231689e+06 | 5.279340e+06 | 2786648.75 | 1.997182e+06 | 5.903015e+06 | 1.262235e+06 | 3.429916e+06 | 3.629290e+06 | 1.938148e+06 | 3.537842e+06 | 1.601956e+06 | 6.626248e+06 | 4.033281e+06 | 2.925780e+06 | 6.467534e+06 | 3.076813e+06 | 3.492330e+06 | 7.870680e+06 | 7.273156e+06 | 2.473302e+06 | 2442220.0 | 12086616.0 | 1.852828e+06 | 783423.000000 | 1.481031e+06 | 1.267959e+06 | 2.800176e+06 | 2280352.5 | 5.777507e+06 | 2.297724e+06 | 7.532752e+05 | 7.366733e+06 | 2.095775e+06 | 7.632631e+06 | 9.232576e+06 | 1.798650e+07 | 13424079.0 | 8.930532e+05 |
| 75% | 2.685298e+06 | 1.938613e+06 | 1.475754e+06 | 1.218412e+07 | 44984468.0 | 3.877010e+07 | 2.472219e+06 | 4.336405e+06 | 3095882.0 | 6.118300e+06 | 287763.5 | 6.034599e+07 | 1.174479e+07 | 3.178583e+06 | 3.936412e+07 | 5.797149e+06 | 2.274106e+07 | 2.870089e+06 | 1.590601e+07 | 2786648.75 | 3.985863e+06 | 7.548807e+06 | 2.226575e+06 | 8.479784e+06 | 8.065290e+06 | 5.165356e+06 | 1.483348e+07 | 1.950092e+06 | 1.775030e+07 | 5.445907e+06 | 5.944156e+06 | 7.132999e+06 | 3.094090e+06 | 4.483848e+06 | 8.487634e+06 | 1.373020e+07 | 2.604234e+06 | 2442220.0 | 12086616.0 | 2.005999e+06 | 817238.406250 | 3.149600e+06 | 2.683760e+06 | 5.411051e+06 | 2280352.5 | 9.583717e+06 | 5.449356e+06 | 9.708208e+05 | 1.321218e+07 | 2.919119e+06 | 1.125162e+07 | 1.252517e+07 | 3.929327e+07 | 13424079.0 | 1.544996e+06 |
| max | 3.091388e+06 | 3.204686e+06 | 1.581353e+06 | 5.116831e+07 | 44984468.0 | 4.168064e+07 | 2.179583e+07 | 7.771934e+06 | 3095882.0 | 9.956396e+06 | 287763.5 | 6.809073e+07 | 7.258702e+07 | 1.356777e+07 | 5.316616e+07 | 1.858886e+07 | 2.595964e+07 | 1.023472e+07 | 2.873150e+07 | 2786648.75 | 2.641677e+07 | 1.452439e+07 | 4.473116e+07 | 6.951981e+07 | 6.533698e+07 | 1.083409e+07 | 5.975744e+07 | 3.540226e+06 | 4.651096e+07 | 6.858534e+06 | 2.276783e+07 | 7.798464e+06 | 3.574542e+06 | 5.475367e+06 | 8.521989e+06 | 6.383519e+07 | 2.735166e+06 | 2442220.0 | 12086616.0 | 2.159170e+06 | 851053.812500 | 1.769542e+07 | 7.475578e+06 | 1.596586e+07 | 2280352.5 | 1.548068e+07 | 4.006289e+07 | 2.158629e+06 | 4.279207e+07 | 7.606084e+06 | 2.072600e+07 | 1.686598e+07 | 6.176238e+07 | 13424079.0 | 5.587347e+06 |
building_df_cleaning = building_df_cleaning[
~(
(building_df_cleaning["LargestPropertyUseType"] == "K-12 School") &
(building_df_cleaning["SiteEnergyUse(kBtu)"] == 0)
)
]
building_df_cleaning = building_df_cleaning[
building_df_cleaning["OSEBuildingID"] != 19776
]
building_df_cleaning.shape
(1614, 21)
(building_df_cleaning["SiteEnergyUse(kBtu)"] == 0).sum()
np.int64(0)
Parmi les 138 bâtiments de type K-12, 14 présentent une consommation énergétique nulle (SiteEnergyUse = 0).
Ce nombre reste très limité, à la fois par rapport aux bâtiments K-12 et au jeu de données global.
L’analyse montre un fort décalage entre ces bâtiments et les autres K-12, mais aussi par rapport aux autres types de bâtiments. Alors que la majorité des K-12 présente des consommations énergétiques élevées, ces 14 observations sont totalement à zéro.
De plus, certains de ces bâtiments affichent des émissions de CO₂ non nulles, ce qui constitue une incohérence physique, puisqu’il est impossible d’émettre des gaz à effet de serre sans consommation d’énergie associée.
Ces observations sont donc considérées comme non fiables. Étant donné que SiteEnergyUse est une variable cible, aucune imputation pertinente n’est possible.
Afin d’éviter l’introduction d’un biais dans la modélisation, la décision a été prise de supprimer ces 14 bâtiments du jeu de données.
Cette suppression est justifiée à la fois par :
leur faible proportion dans le jeu de données,
leur incohérence métier,
et le risque de biais significatif qu’ils pourraient introduire lors de la modélisation.
Enfin, Un dernier bâtiment ("Other - Education") présente une consommation énefrgétique nulle. Elle constitue un cas isolé et unique dans le jeu de données. Ce bâtiment a également été supprimé.
A ce stade, le jeu de données comprends 1614 bâtiments décrit par 21 variables.
fig, axes = plt.subplots(1, 2, figsize=(14,5))
sns.histplot(
np.log1p(building_df_cleaning["SiteEnergyUse(kBtu)"]),
bins=50,
ax=axes[0]
)
axes[0].set_title("Distribution log(SiteEnergyUse + 1)")
sns.histplot(
np.log1p(building_df_cleaning["TotalGHGEmissions"]),
bins=50,
ax=axes[1]
)
axes[1].set_title("Distribution log(TotalGHGEmissions + 1)")
plt.tight_layout()
plt.show()
L’affichage en échelle logarithmique met en évidence des distributions nettement plus équilibrées pour les variables cibles.
EDA bivariée : target <-> variables explicatives numériques¶
L'objectif est de comprendre ce qui explique la consommation d'energie et les émissions de CO2
-> Zoom sur PropertyGFA (Total/Building/Parking)
fig, axes = plt.subplots(1, 2, figsize=(14,5))
sns.scatterplot(
x=building_df_cleaning["PropertyGFATotal"],
y=building_df_cleaning["SiteEnergyUse(kBtu)"],
ax=axes[0]
)
axes[0].set_title("SiteEnergyUse vs Surface (GFA)")
sns.scatterplot(
x=building_df_cleaning["PropertyGFATotal"],
y=building_df_cleaning["TotalGHGEmissions"],
ax=axes[1]
)
axes[1].set_title("TotalGHGEmissions vs Surface (GFA)")
plt.show()
fig, axes = plt.subplots(1, 2, figsize=(14,5))
sns.scatterplot(
x=np.log1p(building_df_cleaning["PropertyGFATotal"]),
y=np.log1p(building_df_cleaning["SiteEnergyUse(kBtu)"]),
ax=axes[0]
)
axes[0].set_title("log(SiteEnergyUse) vs log(Surface)")
sns.scatterplot(
x=np.log1p(building_df_cleaning["PropertyGFATotal"]),
y=np.log1p(building_df_cleaning["TotalGHGEmissions"]),
ax=axes[1]
)
axes[1].set_title("log(TotalGHGEmissions) vs log(Surface)")
plt.show()
Plus un bâtiment est grand, plus il consomme d’énergie et plus il émet de CO₂, ce qui est cohérent d’un point de vue métier.
fig, axes = plt.subplots(1, 3, figsize=(18,5))
sns.scatterplot(
x=np.log1p(building_df_cleaning["PropertyGFATotal"]),
y=np.log1p(building_df_cleaning["SiteEnergyUse(kBtu)"]),
ax=axes[0]
)
axes[0].set_title("log(SiteEnergyUse) vs log(Surface totale)")
sns.scatterplot(
x=np.log1p(building_df_cleaning["PropertyGFABuilding(s)"]),
y=np.log1p(building_df_cleaning["SiteEnergyUse(kBtu)"]),
ax=axes[1]
)
axes[1].set_title("log(SiteEnergyUse) vs log(Surface bâtiment)")
sns.scatterplot(
x=np.log1p(building_df_cleaning["PropertyGFAParking"]),
y=np.log1p(building_df_cleaning["SiteEnergyUse(kBtu)"]),
ax=axes[2]
)
axes[2].set_title("log(SiteEnergyUse) vs log(Surface parking)")
plt.tight_layout()
plt.show()
fig, axes = plt.subplots(1, 3, figsize=(18,5))
sns.scatterplot(
x=np.log1p(building_df_cleaning["PropertyGFATotal"]),
y=np.log1p(building_df_cleaning["TotalGHGEmissions"]),
ax=axes[0]
)
axes[0].set_title("log(GHG) vs log(Surface totale)")
sns.scatterplot(
x=np.log1p(building_df_cleaning["PropertyGFABuilding(s)"]),
y=np.log1p(building_df_cleaning["TotalGHGEmissions"]),
ax=axes[1]
)
axes[1].set_title("log(GHG) vs log(Surface bâtiment)")
sns.scatterplot(
x=np.log1p(building_df_cleaning["PropertyGFAParking"]),
y=np.log1p(building_df_cleaning["TotalGHGEmissions"]),
ax=axes[2]
)
axes[2].set_title("log(GHG) vs log(Surface parking)")
plt.tight_layout()
plt.show()
cols_for_corr = [
"PropertyGFATotal",
"PropertyGFABuilding(s)",
"PropertyGFAParking",
"SiteEnergyUse(kBtu)",
"TotalGHGEmissions"
]
plt.figure(figsize=(6,4))
sns.heatmap(
building_df_cleaning[cols_for_corr].apply(np.log1p).corr(),
annot=True,
cmap="coolwarm",
fmt=".2f"
)
plt.title("Corrélation entre les surfaces")
plt.show()
# Suppression de la surface totale (redondante)
building_df_cleaning = building_df_cleaning.drop(columns=["PropertyGFATotal"], errors="ignore")
building_df_cleaning.shape
(1614, 20)
La surface totale étant fortement corrélée à la surface du bâtiment, elle est supprimée afin d’éviter la redondance.
La surface du bâtiment et la surface de parking sont conservées séparément, afin de permettre au modèle d’apprendre des effets différenciés :
la surface du bâtiment étant directement liée à la consommation énergétique et l'émission de CO2, tandis que la surface de parking peut avoir un impact variable selon son usage.
-> Zoom sur NumberofBuilding et NumberofFloors
fig, axes = plt.subplots(1, 2, figsize=(14,5))
sns.scatterplot(
x=building_df_cleaning["NumberofFloors"],
y=np.log1p(building_df_cleaning["SiteEnergyUse(kBtu)"]),
ax=axes[0]
)
axes[0].set_title("log(SiteEnergyUse) vs Number of Floors")
sns.scatterplot(
x=building_df_cleaning["NumberofBuildings"],
y=np.log1p(building_df_cleaning["SiteEnergyUse(kBtu)"]),
ax=axes[1]
)
axes[1].set_title("log(SiteEnergyUse) vs Number of Buildings")
plt.tight_layout()
plt.show()
fig, axes = plt.subplots(1, 2, figsize=(14,5))
sns.scatterplot(
x=building_df_cleaning["NumberofFloors"],
y=np.log1p(building_df_cleaning["TotalGHGEmissions"]),
ax=axes[0]
)
axes[0].set_title("log(TotalGHGEmissions) vs Number of Floors")
sns.scatterplot(
x=building_df_cleaning["NumberofBuildings"],
y=np.log1p(building_df_cleaning["TotalGHGEmissions"]),
ax=axes[1]
)
axes[1].set_title("log(TotalGHGEmissions) vs Number of Buildings")
plt.tight_layout()
plt.show()
building_df_cleaning[
["OSEBuildingID", "PrimaryPropertyType", "LargestPropertyUseType",
"PropertyGFABuilding(s)", "NumberofFloors", "NumberofBuildings","TotalGHGEmissions","SiteEnergyUse(kBtu)"]
].sort_values(by="NumberofFloors", ascending=False).head(5)
| OSEBuildingID | PrimaryPropertyType | LargestPropertyUseType | PropertyGFABuilding(s) | NumberofFloors | NumberofBuildings | TotalGHGEmissions | SiteEnergyUse(kBtu) | |
|---|---|---|---|---|---|---|---|---|
| 1359 | 21611 | Worship Facility | Worship Facility | 21948 | 99 | 1.0 | 2.27 | 3.260012e+05 |
| 233 | 357 | Large Office | Office | 1195387 | 63 | 1.0 | 429.27 | 6.157618e+07 |
| 292 | 422 | Large Office | Office | 1215718 | 56 | 1.0 | 525.78 | 4.951770e+07 |
| 271 | 399 | Large Office | Office | 1115000 | 55 | 1.0 | 588.90 | 5.307916e+07 |
| 229 | 353 | Large Office | Office | 754455 | 49 | 1.0 | 627.87 | 4.516331e+07 |
cols_corr = [
"NumberofFloors",
"NumberofBuildings",
"SiteEnergyUse(kBtu)",
"TotalGHGEmissions",
"PropertyGFABuilding(s)",
"PropertyGFAParking"
]
corr_matrix = building_df_cleaning[cols_corr].apply(np.log1p).corr()
plt.figure(figsize=(6,5))
sns.heatmap(
corr_matrix,
annot=True,
cmap="coolwarm",
fmt=".2f"
)
plt.title("Corrélations (log) : étages / bâtiments / cibles / surface")
plt.show()
building_df_cleaning = building_df_cleaning.drop(columns=["NumberofFloors"])
building_df_cleaning.shape
(1614, 19)
Les surfaces (PropertyGFABuilding(s) et PropertyGFAParking) constituent des mesures physiques directes, plus stables et plus explicatives de la consommation énergétique et des émissions que le nombre d’étages, qui agit principalement comme une variable proxy.
De plus, des incohérences ont été observées sur certaines valeurs extrêmes du nombre d’étages.
Afin de limiter la redondance et d’améliorer la robustesse du modèle, la variable NumberofFloors est supprimée au profit des surfaces construites.
A ce stade, le jeu de données comprends 1614 observations et 19 colonnes.
-> Zoom sur YearBuilt
CURRENT_YEAR = 2016 # année du dataset
building_df_cleaning["BuildingAge"] = (
CURRENT_YEAR - building_df_cleaning["YearBuilt"]
)
building_df_cleaning = building_df_cleaning[
building_df_cleaning["BuildingAge"] >= 0
]
building_df_cleaning = building_df_cleaning.drop(columns=["YearBuilt"])
building_df_cleaning.shape
(1614, 19)
building_df_cleaning
| OSEBuildingID | BuildingType | PrimaryPropertyType | PropertyName | CouncilDistrictCode | Neighborhood | Latitude | Longitude | NumberofBuildings | PropertyGFAParking | PropertyGFABuilding(s) | ListOfAllPropertyUseTypes | LargestPropertyUseType | LargestPropertyUseTypeGFA | ENERGYSTARScore | SiteEnergyUse(kBtu) | TotalGHGEmissions | has_secondary_use | BuildingAge | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | NonResidential | Hotel | Mayflower park hotel | 7 | DOWNTOWN | 47.61220 | -122.33799 | 1.0 | 0 | 88434 | Hotel | Hotel | 88434.0 | 60.0 | 7.226362e+06 | 249.98 | 0 | 89 |
| 1 | 2 | NonResidential | Hotel | Paramount Hotel | 7 | DOWNTOWN | 47.61317 | -122.33393 | 1.0 | 15064 | 88502 | Hotel, Parking, Restaurant | Hotel | 83880.0 | 61.0 | 8.387933e+06 | 295.86 | 1 | 20 |
| 2 | 3 | NonResidential | Hotel | 5673-The Westin Seattle | 7 | DOWNTOWN | 47.61393 | -122.33810 | 1.0 | 196718 | 759392 | Hotel | Hotel | 756493.0 | 43.0 | 7.258702e+07 | 2089.28 | 0 | 47 |
| 3 | 5 | NonResidential | Hotel | HOTEL MAX | 7 | DOWNTOWN | 47.61412 | -122.33664 | 1.0 | 0 | 61320 | Hotel | Hotel | 61320.0 | 56.0 | 6.794584e+06 | 286.43 | 0 | 90 |
| 4 | 8 | NonResidential | Hotel | WARWICK SEATTLE HOTEL (ID8) | 7 | DOWNTOWN | 47.61375 | -122.34047 | 1.0 | 62000 | 113580 | Hotel, Parking, Swimming Pool | Hotel | 123445.0 | 75.0 | 1.417261e+07 | 505.01 | 1 | 36 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3371 | 50222 | Nonresidential COS | Office | Horticulture building | 2 | GREATER DUWAMISH | 47.56722 | -122.31154 | 1.0 | 0 | 12294 | Office | Office | 12294.0 | 46.0 | 8.497457e+05 | 20.94 | 0 | 26 |
| 3372 | 50223 | Nonresidential COS | Other | International district/Chinatown CC | 2 | DOWNTOWN | 47.59625 | -122.32283 | 1.0 | 0 | 16000 | Other - Recreation | Other - Recreation | 16000.0 | 73.0 | 9.502762e+05 | 32.17 | 0 | 12 |
| 3373 | 50224 | Nonresidential COS | Other | Queen Anne Pool | 7 | MAGNOLIA / QUEEN ANNE | 47.63644 | -122.35784 | 1.0 | 0 | 13157 | Fitness Center/Health Club/Gym, Other - Recrea... | Other - Recreation | 7583.0 | 73.0 | 5.765898e+06 | 223.54 | 1 | 42 |
| 3374 | 50225 | Nonresidential COS | Mixed Use Property | South Park Community Center | 1 | GREATER DUWAMISH | 47.52832 | -122.32431 | 1.0 | 0 | 14101 | Fitness Center/Health Club/Gym, Food Service, ... | Other - Recreation | 6601.0 | 73.0 | 7.194712e+05 | 22.11 | 1 | 27 |
| 3375 | 50226 | Nonresidential COS | Mixed Use Property | Van Asselt Community Center | 2 | GREATER DUWAMISH | 47.53939 | -122.29536 | 1.0 | 0 | 18258 | Fitness Center/Health Club/Gym, Food Service, ... | Other - Recreation | 8271.0 | 73.0 | 1.152896e+06 | 41.27 | 1 | 78 |
1614 rows × 19 columns
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
sns.scatterplot(
x=building_df_cleaning["BuildingAge"],
y=np.log1p(building_df_cleaning["SiteEnergyUse(kBtu)"]),
ax=axes[0]
)
axes[0].set_title("log(SiteEnergyUse) vs Building Age")
sns.scatterplot(
x=building_df_cleaning["BuildingAge"],
y=np.log1p(building_df_cleaning["TotalGHGEmissions"]),
ax=axes[1]
)
axes[1].set_title("log(TotalGHGEmissions) vs Building Age")
plt.tight_layout()
plt.show()
Afin d’analyser l’influence de l’ancienneté des bâtiments, nous avons créé une variable BuildingAge, calculée à partir de l’année de construction.
La dispersion reste importante pour l’ensemble des âges, indiquant que l’ancienneté seule n’explique pas directement les niveaux de consommation ou d’émissions.
Néanmoins, cette variable reste pertinente d’un point de vue métier (normes de construction, isolation, technologies disponibles) et pourrait interagir avec d’autres caractéristiques du bâtiment.
Elle est donc conservée à ce stade pour la phase de modélisation.
-> Zoom sur ENERGYSTARScore
plt.figure(figsize=(6,4))
sns.histplot(building_df_cleaning["ENERGYSTARScore"], bins=30, kde=True)
plt.title("Distribution de l'ENERGYSTARScore")
plt.xlabel("ENERGYSTARScore")
plt.ylabel("Count")
plt.show()
Dans la documentation du site officiel, il est indiqué que "A score of 50 represents the national median."
L'analyse exploratoire montre que la majorité des bâtiments de Seattle présentent un score supérieur à 50, suggérant une performance énergétique globalement meilleure que la médiane nationale.
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
sns.scatterplot(
x=building_df_cleaning["ENERGYSTARScore"],
y=np.log1p(building_df_cleaning["SiteEnergyUse(kBtu)"]),
ax=axes[0]
)
axes[0].set_title("log(SiteEnergyUse) vs ENERGYSTARScore")
sns.scatterplot(
x=building_df_cleaning["ENERGYSTARScore"],
y=np.log1p(building_df_cleaning["TotalGHGEmissions"]),
ax=axes[1]
)
axes[1].set_title("log(TotalGHGEmissions) vs ENERGYSTARScore")
plt.tight_layout()
plt.show()
building_df_cleaning[
["ENERGYSTARScore", "SiteEnergyUse(kBtu)", "TotalGHGEmissions"]
].corr()
| ENERGYSTARScore | SiteEnergyUse(kBtu) | TotalGHGEmissions | |
|---|---|---|---|
| ENERGYSTARScore | 1.000000 | -0.044202 | -0.083981 |
| SiteEnergyUse(kBtu) | -0.044202 | 1.000000 | 0.810560 |
| TotalGHGEmissions | -0.083981 | 0.810560 | 1.000000 |
building_df_cleaning = building_df_cleaning.drop(columns=["ENERGYSTARScore"])
building_df_cleaning.shape
(1614, 18)
L’analyse montre une très faible corrélation entre l’ENERGYSTARScore et les variables cibles (consommation énergétique et émissions de CO2), ce qui est confirmé par des nuages de points très dispersés, sans tendance identifiable.
De plus, la distribution de l’ENERGYSTARScore est fortement concentrée autour de certaines valeurs. (Une variable qui prend souvent la même valeur apporte peu d’information au modèle.)
Enfin, ce score étant construit à partir d’informations liées à la consommation énergétique, son utilisation comme variable explicative pourrait introduire un biais de type fuite de cible.
En conséquence, l’ENERGYSTARScore n’est pas retenu pour la phase de modélisation.
A ce stade, le modèle comprends 1614 bâtiments décrit par 18 variables.
-> Zoom sur LargestPropertyUseTypeGFA
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
sns.scatterplot(
x=np.log1p(building_df_cleaning["LargestPropertyUseTypeGFA"]),
y=np.log1p(building_df_cleaning["SiteEnergyUse(kBtu)"]),
ax=axes[0],
alpha=0.4
)
axes[0].set_title("log(SiteEnergyUse) vs log(LargestPropertyUseTypeGFA)")
axes[0].set_xlabel("log(LargestPropertyUseTypeGFA)")
sns.scatterplot(
x=np.log1p(building_df_cleaning["LargestPropertyUseTypeGFA"]),
y=np.log1p(building_df_cleaning["TotalGHGEmissions"]),
ax=axes[1],
alpha=0.4
)
axes[1].set_title("log(TotalGHGEmissions) vs log(LargestPropertyUseTypeGFA)")
axes[1].set_xlabel("log(LargestPropertyUseTypeGFA)")
plt.tight_layout()
plt.show()
plt.figure(figsize=(6,4))
sns.heatmap(
building_df_cleaning[["LargestPropertyUseTypeGFA", "SiteEnergyUse(kBtu)", "TotalGHGEmissions"]].corr(),
annot=True,
cmap="coolwarm",
fmt=".2f"
)
plt.title("Corrélation entre les surfaces")
plt.show()
EDA bivariée : target <-> variables explicatives catégorielles¶
-> Zoom sur les usages
order_energy = (
building_df_cleaning
.groupby("LargestPropertyUseType")["SiteEnergyUse(kBtu)"]
.median()
.sort_values()
.index
)
plt.figure(figsize=(18, 6))
sns.boxplot(
data=building_df_cleaning,
x="LargestPropertyUseType",
y=np.log1p(building_df_cleaning["SiteEnergyUse(kBtu)"]),
order=order_energy
)
plt.xticks(rotation=45, ha="right")
plt.ylabel("log(SiteEnergyUse)")
plt.xlabel("LargestPropertyUseType")
plt.title("Distribution de log(SiteEnergyUse) par usage principal")
plt.tight_layout()
plt.show()
order_ghg = (
building_df_cleaning
.groupby("LargestPropertyUseType")["TotalGHGEmissions"]
.median()
.sort_values()
.index
)
plt.figure(figsize=(18, 6))
sns.boxplot(
data=building_df_cleaning,
x="LargestPropertyUseType",
y=np.log1p(building_df_cleaning["TotalGHGEmissions"]),
order=order_ghg
)
plt.xticks(rotation=45, ha="right")
plt.ylabel("log(TotalGHGEmissions)")
plt.xlabel("LargestPropertyUseType")
plt.title("Distribution de log(TotalGHGEmissions) par usage principal")
plt.tight_layout()
plt.show()
Les boxplots montrent que l’usage principal du bâtiment influence la distribution de la consommation énergétique et des émissions de CO2.
Certains usages (ex. hôtels, supermarchés, bureaux médicaux) présentent des niveaux médians plus élevés, tandis que d’autres sont plus sobres et plus homogènes.
Cependant, la dispersion intra-usage reste importante, suggérant que l’usage seul n’explique pas entièrement les consommations, et doit être combiné à des variables de surface et de structure.
usage_energy = (
building_df_cleaning
.groupby("LargestPropertyUseType")["SiteEnergyUse(kBtu)"]
.mean()
.sort_values(ascending=False)
)
plt.figure(figsize=(12, 5))
usage_energy.plot(kind="bar")
plt.title("Consommation énergétique moyenne par usage")
plt.ylabel("SiteEnergyUse (kBtu)")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()
usage_ghg = (
building_df_cleaning
.groupby("LargestPropertyUseType")["TotalGHGEmissions"]
.mean()
.sort_values(ascending=False)
)
plt.figure(figsize=(12, 5))
usage_ghg.plot(kind="bar")
plt.title("Émissions de GES moyennes par usage")
plt.ylabel("TotalGHGEmissions")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()
Certains usages, tels que les hôpitaux, les universités ou les laboratoires, apparaissent particulièrement énergivores en moyenne, ce qui est cohérent avec leurs contraintes opérationnelles (fonctionnement continu, équipements spécifiques).
D’autres usages apparaissent beaucoup moins énergivores en moyenne.
Notons :
Ces moyennes sont fortement influencées par la taille des bâtiments, la présence de très gros sites, et le nombre d’observations par usage.
Le type d’usage principal (LargestPropertyUseType) apporte donc une information structurante sur les niveaux de consommation énergétique et d’émissions de GES,
mais son pouvoir explicatif reste partiel.
En conséquence, cette variable est conservée pour la phase de modélisation, en complément des variables quantitatives liées à la surface et aux caractéristiques du bâtiment.
usage_counts = building_df_cleaning["has_secondary_use"].value_counts()
labels = ["Mono-usage", "Multi-usage"]
sizes = [
usage_counts.get(0, 0),
usage_counts.get(1, 0)
]
plt.figure(figsize=(6, 6))
plt.pie(
sizes,
labels=labels,
autopct="%1.1f%%",
startangle=90
)
plt.title("Répartition des bâtiments : mono-usage vs multi-usage")
plt.axis("equal")
plt.show()
La répartition des bâtiments entre mono-usage et multi-usage est relativement équilibrée, avec une légère majorité de bâtiments multi-usages (environ 52 %) contre 48 % de bâtiments mono-usages.
Cela montre que la diversité des usages au sein d’un même bâtiment est une caractéristique fréquente du parc immobilier étudié.
plt.figure(figsize=(6, 5))
sns.boxplot(
data=building_df_cleaning,
x="has_secondary_use",
y=np.log1p(building_df_cleaning["SiteEnergyUse(kBtu)"])
)
plt.xticks([0, 1], ["Mono-usage", "Multi-usage"])
plt.xlabel("Type d'usage")
plt.ylabel("log(SiteEnergyUse)")
plt.title("Impact du mono / multi-usage sur la consommation énergétique")
plt.tight_layout()
plt.show()
plt.figure(figsize=(6, 5))
sns.boxplot(
data=building_df_cleaning,
x="has_secondary_use",
y=np.log1p(building_df_cleaning["TotalGHGEmissions"])
)
plt.xticks([0, 1], ["Mono-usage", "Multi-usage"])
plt.xlabel("Type d'usage")
plt.ylabel("log(TotalGHGEmissions)")
plt.title("Impact du mono / multi-usage sur les émissions de GES")
plt.tight_layout()
plt.show()
Les boxplots montrent que les bâtiments multi-usages présentent en médiane une consommation énergétique (SiteEnergyUse) et des émissions de GES légèrement plus élevées que les bâtiments mono-usages.
On observe également une dispersion plus importante pour les bâtiments multi-usages, ce qui suggère une plus grande hétérogénéité des comportements énergétiques dans cette catégorie.
building_df_cleaning = building_df_cleaning.drop(columns=["PrimaryPropertyType","ListOfAllPropertyUseTypes"])
building_df_cleaning.shape
(1614, 16)
- Une analyse exploratoire a également été menée sur la variable
PrimaryPropertyType.
Cependant, cette variable repose sur une catégorisation plus agrégée des usages, ce qui conduit à une perte d’information par rapport à LargestPropertyUseType, notamment pour des usages énergivores spécifiques (par exemple les data centers ou les laboratoires).
En comparaison, LargestPropertyUseType offre une granularité plus fine et une meilleure capacité à distinguer des profils énergétiques distincts.
- La variable
ListOfAllPropertyUseTypescontient une description textuelle et multivaluée des usages d’un bâtiment.
Bien qu’informative d’un point de vue descriptif, elle est difficilement exploitable dans un cadre de modélisation.
Ces deux variables ont donc été écartées au profit de variables plus structurées.
-> Zoom sur BuildingType
plt.figure(figsize=(10, 4))
building_df_cleaning["BuildingType"].value_counts().plot(kind="bar")
plt.title("Répartition des bâtiments par BuuildingType")
plt.ylabel("Nombre de bâtiments")
plt.xlabel("BuildingType")
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
building_df_cleaning = building_df_cleaning.drop(columns=["BuildingType"],errors="ignore")
building_df_cleaning.shape
(1614, 15)
Cette variable apporte principalement une information administrative et non directement liée aux caractéristiques physique ou à l'usage énergétique des bâtiments.
Compte tenu de sa forte déséquilibration et de sa redondance avec des variables plus pertinentes (telles que l'usage principal du bâtiment), BuildingType est écartée pour la modélisation.
A ce stade, le jeu de donnée comprends 1614 bâtiments et 15 variables.
-> Zoom sur Neighborhood et CouncilDistrictCode
plt.figure(figsize=(10, 4))
building_df_cleaning["Neighborhood"].value_counts().plot(kind="bar")
plt.title("Répartition des bâtiments par quartier")
plt.ylabel("Nombre de bâtiments")
plt.xlabel("Neighborhood")
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
plt.figure(figsize=(12, 5))
sns.boxplot(
data=building_df_cleaning,
x="Neighborhood",
y=np.log1p(building_df_cleaning["SiteEnergyUse(kBtu)"])
)
plt.title("Consommation énergétique par quartier")
plt.ylabel("log(SiteEnergyUse)")
plt.xlabel("Neighborhood")
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
plt.figure(figsize=(12, 5))
sns.boxplot(
data=building_df_cleaning,
x="Neighborhood",
y=np.log1p(building_df_cleaning["TotalGHGEmissions"])
)
plt.title("Émissions de GES par quartier")
plt.ylabel("log(TotalGHGEmissions)")
plt.xlabel("Neighborhood")
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
plt.figure(figsize=(10, 4))
building_df_cleaning["CouncilDistrictCode"].value_counts().plot(kind="bar")
plt.title("Répartition des bâtiments par quartier")
plt.ylabel("Nombre de bâtiments")
plt.xlabel("CouncilDistrictCode")
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
plt.figure(figsize=(12, 5))
sns.boxplot(
data=building_df_cleaning,
x="CouncilDistrictCode",
y=np.log1p(building_df_cleaning["SiteEnergyUse(kBtu)"])
)
plt.title("Consommation énergétique par quartier")
plt.ylabel("log(SiteEnergyUse)")
plt.xlabel("CouncilDistrictCode")
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
plt.figure(figsize=(12, 5))
sns.boxplot(
data=building_df_cleaning,
x="CouncilDistrictCode",
y=np.log1p(building_df_cleaning["TotalGHGEmissions"])
)
plt.title("Consommation énergétique par quartier")
plt.ylabel("TotalGHGEmissions")
plt.xlabel("CouncilDistrictCode")
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
building_df_cleaning = building_df_cleaning.drop(columns=["CouncilDistrictCode"],errors="ignore")
building_df_cleaning.shape
(1614, 14)
L’analyse par CouncilDistrictCode montre une répartition plus équilibrée des bâtiments entre catégories.
Toutefois, les distributions de consommation énergétique et d’émissions de GES présentent une forte dispersion au sein de chaque district et des médianes relativement proches.
Cela suggère que le signal apporté par cette variable est limité pour expliquer les cibles.
L’analyse par quartier met en évidence des différences de distribution de la consommation énergétique et des émissions de GES selon les zones géographiques.
Bien que certaines catégories soient peu représentées, la variable Neighborhood semble capter des effets urbains et structurels plus fins que les districts administratifs.
Étant donné la redondance géographique entre Neighborhood et CouncilDistrictCode, et afin de limiter la complexité du modèle,
seule la variable Neighborhood est conservée pour la suite de la modélisation.
A ce stade, le jeu de donnée comprends 1614 bâtiments décrit par 14 variables.
Localisation¶
plt.figure(figsize=(8, 8))
plt.scatter(
building_df_cleaning["Longitude"],
building_df_cleaning["Latitude"],
c=np.log1p(building_df_cleaning["SiteEnergyUse(kBtu)"]),
cmap="viridis",
alpha=0.6
)
plt.colorbar(label="log(SiteEnergyUse)")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.title("Répartition spatiale des bâtiments et consommation énergétique")
plt.tight_layout()
plt.show()
plt.figure(figsize=(8, 8))
plt.scatter(
building_df_cleaning["Longitude"],
building_df_cleaning["Latitude"],
c=np.log1p(building_df_cleaning["TotalGHGEmissions"]),
cmap="viridis",
alpha=0.6
)
plt.colorbar(label="log(TotalGHGEmissions)")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.title("Répartition spatiale des bâtiments et émission de CO2")
plt.tight_layout()
plt.show()
building_df_cleaning = building_df_cleaning.drop(columns=["Latitude","Longitude","PropertyName"],errors="ignore")
building_df_cleaning.shape
(1614, 11)
Les représentations spatiales mettent en évidence une forte concentration des bâtiments dans certaines zones, notamment au centre de la ville.
Cependant, pour une même zone géographique, on observe une grande variabilité des niveaux de consommation énergétique et d’émissions de gaz à effet de serre.
Cela suggère que la localisation seule n’est pas un facteur explicatif suffisant et que les caractéristiques propres aux bâtiments (usage, surface, âge, etc.) jouent un rôle prépondérant.
Ces visualisations sont utilisées ici à des fins exploratoires et descriptives, afin de mieux comprendre la distribution spatiale des données.
Nous supprimons : les colonnes Latitude, Longitude et PropertyName .
Notre jeu de données contient 1614 bâtiments décrit par 11 variables.
building_df_cleaning.dtypes
OSEBuildingID int64 Neighborhood str NumberofBuildings float64 PropertyGFAParking int64 PropertyGFABuilding(s) int64 LargestPropertyUseType str LargestPropertyUseTypeGFA float64 SiteEnergyUse(kBtu) float64 TotalGHGEmissions float64 has_secondary_use int64 BuildingAge int64 dtype: object
Export du dataset¶
building_df_cleaning.to_csv("../data/processed/building_df_cleaned.csv", index=False)