projet : ancien énoncé ANALYSEZ LES VENTES DE LA LIBRAIRIE LAPAGE à L'AIDE DE PYTHON¶
Notebook 1 Préparation des données¶
0. Importation des données¶
In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
In [4]:
produits = pd.read_csv('products.csv')
clients = pd.read_csv('customers.csv')
ventes = pd.read_csv('transactions.csv')
In [5]:
produits.head(5)
Out[5]:
| id_prod | price | categ | |
|---|---|---|---|
| 0 | 0_1421 | 19.99 | 0 |
| 1 | 0_1368 | 5.13 | 0 |
| 2 | 0_731 | 17.99 | 0 |
| 3 | 1_587 | 4.99 | 1 |
| 4 | 0_1507 | 3.99 | 0 |
In [6]:
clients.head(5)
Out[6]:
| client_id | sex | birth | |
|---|---|---|---|
| 0 | c_4410 | f | 1967 |
| 1 | c_7839 | f | 1975 |
| 2 | c_1699 | f | 1984 |
| 3 | c_5961 | f | 1962 |
| 4 | c_5320 | m | 1943 |
In [7]:
ventes.head(5)
Out[7]:
| id_prod | date | session_id | client_id | |
|---|---|---|---|---|
| 0 | 0_1518 | 2022-05-20 13:21:29.043970 | s_211425 | c_103 |
| 1 | 1_251 | 2022-02-02 07:55:19.149409 | s_158752 | c_8534 |
| 2 | 0_1277 | 2022-06-18 15:44:33.155329 | s_225667 | c_6714 |
| 3 | 2_209 | 2021-06-24 04:19:29.835891 | s_52962 | c_6941 |
| 4 | 0_1509 | 2023-01-11 08:22:08.194479 | s_325227 | c_4232 |
1. Nettoyage des données¶
a. La table "produits"¶
Je vérifie qu'il n'y a pas de données dupliquées
In [8]:
produits.loc[produits['id_prod'].duplicated(keep=False),:]
Out[8]:
| id_prod | price | categ |
|---|
Pas de doublons: on peut utiliser "id_prod" comme une clé
Est-ce qu'il y a des valeurs manquantes ?
In [9]:
produits.isna()
Out[9]:
| id_prod | price | categ | |
|---|---|---|---|
| 0 | False | False | False |
| 1 | False | False | False |
| 2 | False | False | False |
| 3 | False | False | False |
| 4 | False | False | False |
| ... | ... | ... | ... |
| 3282 | False | False | False |
| 3283 | False | False | False |
| 3284 | False | False | False |
| 3285 | False | False | False |
| 3286 | False | False | False |
3287 rows × 3 columns
In [10]:
(produits.isna().sum()/produits.shape[0]).sort_values(ascending=True)
Out[10]:
id_prod 0.0 price 0.0 categ 0.0 dtype: float64
b. La table " clients "¶
Je vérifie qu'il n'y a pas de données dupliquées
In [11]:
clients.loc[clients['client_id'].duplicated(keep=False),:]
Out[11]:
| client_id | sex | birth |
|---|
Pas de doublons: on peut utiliser "client_id" comme une clé
Est-ce qu'il y a des valeurs manquantes ?
In [12]:
clients.isna()
Out[12]:
| client_id | sex | birth | |
|---|---|---|---|
| 0 | False | False | False |
| 1 | False | False | False |
| 2 | False | False | False |
| 3 | False | False | False |
| 4 | False | False | False |
| ... | ... | ... | ... |
| 8618 | False | False | False |
| 8619 | False | False | False |
| 8620 | False | False | False |
| 8621 | False | False | False |
| 8622 | False | False | False |
8623 rows × 3 columns
In [13]:
(clients.isna().sum()/clients.shape[0]).sort_values(ascending=True)
Out[13]:
client_id 0.0 sex 0.0 birth 0.0 dtype: float64
c. La table "ventes"¶
Je vérifie qu'il n'y a pas de données dupliquées
In [14]:
ventes.loc[ventes['session_id'].duplicated(keep=False),:]
Out[14]:
| id_prod | date | session_id | client_id | |
|---|---|---|---|---|
| 0 | 0_1518 | 2022-05-20 13:21:29.043970 | s_211425 | c_103 |
| 1 | 1_251 | 2022-02-02 07:55:19.149409 | s_158752 | c_8534 |
| 2 | 0_1277 | 2022-06-18 15:44:33.155329 | s_225667 | c_6714 |
| 4 | 0_1509 | 2023-01-11 08:22:08.194479 | s_325227 | c_4232 |
| 5 | 0_1418 | 2022-10-20 15:59:16.084030 | s_285425 | c_1478 |
| ... | ... | ... | ... | ... |
| 679523 | 1_190 | 2022-01-30 06:17:15.265379 | s_157285 | c_656 |
| 679525 | 0_1434 | 2022-12-14 13:18:29.143723 | s_311959 | c_3653 |
| 679529 | 0_1425 | 2022-12-20 04:33:37.584749 | s_314704 | c_304 |
| 679530 | 0_1994 | 2021-07-16 20:36:35.350579 | s_63204 | c_2227 |
| 679531 | 1_523 | 2022-09-28 01:12:01.973763 | s_274568 | c_3873 |
516182 rows × 4 columns
In [15]:
ventes.duplicated().sum()
Out[15]:
126
Est-ce qu'il y a des valeurs manquantes ?
In [16]:
ventes.isna()
Out[16]:
| id_prod | date | session_id | client_id | |
|---|---|---|---|---|
| 0 | False | False | False | False |
| 1 | False | False | False | False |
| 2 | False | False | False | False |
| 3 | False | False | False | False |
| 4 | False | False | False | False |
| ... | ... | ... | ... | ... |
| 679527 | False | False | False | False |
| 679528 | False | False | False | False |
| 679529 | False | False | False | False |
| 679530 | False | False | False | False |
| 679531 | False | False | False | False |
679532 rows × 4 columns
In [17]:
(ventes.isna().sum()/ventes.shape[0]).sort_values(ascending=True)
Out[17]:
id_prod 0.0 date 0.0 session_id 0.0 client_id 0.0 dtype: float64
2. Jointure¶
a) Jointure entre les tables ventes et produits¶
In [18]:
ventes_produits= pd.merge(ventes, produits, on= 'id_prod', how='left')
ventes_produits.head()
Out[18]:
| id_prod | date | session_id | client_id | price | categ | |
|---|---|---|---|---|---|---|
| 0 | 0_1518 | 2022-05-20 13:21:29.043970 | s_211425 | c_103 | 4.18 | 0.0 |
| 1 | 1_251 | 2022-02-02 07:55:19.149409 | s_158752 | c_8534 | 15.99 | 1.0 |
| 2 | 0_1277 | 2022-06-18 15:44:33.155329 | s_225667 | c_6714 | 7.99 | 0.0 |
| 3 | 2_209 | 2021-06-24 04:19:29.835891 | s_52962 | c_6941 | 69.99 | 2.0 |
| 4 | 0_1509 | 2023-01-11 08:22:08.194479 | s_325227 | c_4232 | 4.99 | 0.0 |
b) Jointure entre les tables ventes_produits et clients¶
In [19]:
donnees= pd.merge(ventes_produits,clients, on= 'client_id', how='left')
donnees.head()
Out[19]:
| id_prod | date | session_id | client_id | price | categ | sex | birth | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0_1518 | 2022-05-20 13:21:29.043970 | s_211425 | c_103 | 4.18 | 0.0 | f | 1986 |
| 1 | 1_251 | 2022-02-02 07:55:19.149409 | s_158752 | c_8534 | 15.99 | 1.0 | m | 1988 |
| 2 | 0_1277 | 2022-06-18 15:44:33.155329 | s_225667 | c_6714 | 7.99 | 0.0 | f | 1968 |
| 3 | 2_209 | 2021-06-24 04:19:29.835891 | s_52962 | c_6941 | 69.99 | 2.0 | m | 2000 |
| 4 | 0_1509 | 2023-01-11 08:22:08.194479 | s_325227 | c_4232 | 4.99 | 0.0 | m | 1980 |
c) Suppression des test "-1"¶
In [20]:
donnees.describe()
Out[20]:
| price | categ | birth | |
|---|---|---|---|
| count | 679311.000000 | 679311.000000 | 679532.000000 |
| mean | 17.449340 | 0.441823 | 1977.817964 |
| std | 18.329034 | 0.595002 | 13.578383 |
| min | -1.000000 | 0.000000 | 1929.000000 |
| 25% | 8.820000 | 0.000000 | 1970.000000 |
| 50% | 13.990000 | 0.000000 | 1980.000000 |
| 75% | 18.990000 | 1.000000 | 1987.000000 |
| max | 300.000000 | 2.000000 | 2004.000000 |
In [21]:
donnees[donnees['price']==-1]
Out[21]:
| id_prod | date | session_id | client_id | price | categ | sex | birth | |
|---|---|---|---|---|---|---|---|---|
| 3019 | T_0 | test_2021-03-01 02:30:02.237419 | s_0 | ct_0 | -1.0 | 0.0 | f | 2001 |
| 5138 | T_0 | test_2021-03-01 02:30:02.237425 | s_0 | ct_0 | -1.0 | 0.0 | f | 2001 |
| 9668 | T_0 | test_2021-03-01 02:30:02.237437 | s_0 | ct_1 | -1.0 | 0.0 | m | 2001 |
| 10728 | T_0 | test_2021-03-01 02:30:02.237436 | s_0 | ct_0 | -1.0 | 0.0 | f | 2001 |
| 15292 | T_0 | test_2021-03-01 02:30:02.237430 | s_0 | ct_0 | -1.0 | 0.0 | f | 2001 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 657830 | T_0 | test_2021-03-01 02:30:02.237417 | s_0 | ct_0 | -1.0 | 0.0 | f | 2001 |
| 662081 | T_0 | test_2021-03-01 02:30:02.237427 | s_0 | ct_1 | -1.0 | 0.0 | m | 2001 |
| 670680 | T_0 | test_2021-03-01 02:30:02.237449 | s_0 | ct_1 | -1.0 | 0.0 | m | 2001 |
| 671647 | T_0 | test_2021-03-01 02:30:02.237424 | s_0 | ct_1 | -1.0 | 0.0 | m | 2001 |
| 679180 | T_0 | test_2021-03-01 02:30:02.237425 | s_0 | ct_1 | -1.0 | 0.0 | m | 2001 |
200 rows × 8 columns
In [22]:
donnees2=donnees[donnees['price']!=-1]
donnees2.head()
Out[22]:
| id_prod | date | session_id | client_id | price | categ | sex | birth | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0_1518 | 2022-05-20 13:21:29.043970 | s_211425 | c_103 | 4.18 | 0.0 | f | 1986 |
| 1 | 1_251 | 2022-02-02 07:55:19.149409 | s_158752 | c_8534 | 15.99 | 1.0 | m | 1988 |
| 2 | 0_1277 | 2022-06-18 15:44:33.155329 | s_225667 | c_6714 | 7.99 | 0.0 | f | 1968 |
| 3 | 2_209 | 2021-06-24 04:19:29.835891 | s_52962 | c_6941 | 69.99 | 2.0 | m | 2000 |
| 4 | 0_1509 | 2023-01-11 08:22:08.194479 | s_325227 | c_4232 | 4.99 | 0.0 | m | 1980 |
d) Calcul de l'âge¶
In [23]:
donnees2['age']= pd.Timestamp.today().year - donnees['birth']
donnees2.head()
C:\Users\User\AppData\Local\Temp\ipykernel_13472\3288089109.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy donnees2['age']= pd.Timestamp.today().year - donnees['birth']
Out[23]:
| id_prod | date | session_id | client_id | price | categ | sex | birth | age | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0_1518 | 2022-05-20 13:21:29.043970 | s_211425 | c_103 | 4.18 | 0.0 | f | 1986 | 37 |
| 1 | 1_251 | 2022-02-02 07:55:19.149409 | s_158752 | c_8534 | 15.99 | 1.0 | m | 1988 | 35 |
| 2 | 0_1277 | 2022-06-18 15:44:33.155329 | s_225667 | c_6714 | 7.99 | 0.0 | f | 1968 | 55 |
| 3 | 2_209 | 2021-06-24 04:19:29.835891 | s_52962 | c_6941 | 69.99 | 2.0 | m | 2000 | 23 |
| 4 | 0_1509 | 2023-01-11 08:22:08.194479 | s_325227 | c_4232 | 4.99 | 0.0 | m | 1980 | 43 |
e) Traitement des " NAN "¶
In [25]:
donnees2['price'].info()
<class 'pandas.core.series.Series'> Int64Index: 679332 entries, 0 to 679531 Series name: price Non-Null Count Dtype -------------- ----- 679111 non-null float64 dtypes: float64(1) memory usage: 10.4 MB
In [26]:
donnees2.isnull().sum()
Out[26]:
id_prod 0 date 0 session_id 0 client_id 0 price 221 categ 221 sex 0 birth 0 age 0 dtype: int64
In [27]:
donnees2[donnees2['categ'].isnull()]
#faire un sort values sur id_prod
Out[27]:
| id_prod | date | session_id | client_id | price | categ | sex | birth | age | |
|---|---|---|---|---|---|---|---|---|---|
| 2633 | 0_2245 | 2022-09-23 07:22:38.636773 | s_272266 | c_4746 | NaN | NaN | m | 1940 | 83 |
| 10106 | 0_2245 | 2022-07-23 09:24:14.133889 | s_242482 | c_6713 | NaN | NaN | f | 1963 | 60 |
| 11727 | 0_2245 | 2022-12-03 03:26:35.696673 | s_306338 | c_5108 | NaN | NaN | m | 1978 | 45 |
| 15675 | 0_2245 | 2021-08-16 11:33:25.481411 | s_76493 | c_1391 | NaN | NaN | m | 1991 | 32 |
| 16377 | 0_2245 | 2022-07-16 05:53:01.627491 | s_239078 | c_7954 | NaN | NaN | m | 1973 | 50 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 669730 | 0_2245 | 2021-08-25 09:06:03.504061 | s_80395 | c_131 | NaN | NaN | m | 1981 | 42 |
| 670682 | 0_2245 | 2022-03-06 19:59:19.462288 | s_175311 | c_4167 | NaN | NaN | f | 1979 | 44 |
| 671286 | 0_2245 | 2022-05-16 11:35:20.319501 | s_209381 | c_4453 | NaN | NaN | m | 1981 | 42 |
| 675679 | 0_2245 | 2022-02-11 09:05:43.952857 | s_163405 | c_1098 | NaN | NaN | m | 1986 | 37 |
| 677996 | 0_2245 | 2021-12-14 22:34:54.589921 | s_134446 | c_4854 | NaN | NaN | m | 1968 | 55 |
221 rows × 9 columns
In [28]:
donnees2= donnees2.fillna({'categ':0.0})
In [29]:
donnees2.isnull().sum()
Out[29]:
id_prod 0 date 0 session_id 0 client_id 0 price 221 categ 0 sex 0 birth 0 age 0 dtype: int64
In [30]:
donnees2[donnees2['price'].isnull()]
Out[30]:
| id_prod | date | session_id | client_id | price | categ | sex | birth | age | |
|---|---|---|---|---|---|---|---|---|---|
| 2633 | 0_2245 | 2022-09-23 07:22:38.636773 | s_272266 | c_4746 | NaN | 0.0 | m | 1940 | 83 |
| 10106 | 0_2245 | 2022-07-23 09:24:14.133889 | s_242482 | c_6713 | NaN | 0.0 | f | 1963 | 60 |
| 11727 | 0_2245 | 2022-12-03 03:26:35.696673 | s_306338 | c_5108 | NaN | 0.0 | m | 1978 | 45 |
| 15675 | 0_2245 | 2021-08-16 11:33:25.481411 | s_76493 | c_1391 | NaN | 0.0 | m | 1991 | 32 |
| 16377 | 0_2245 | 2022-07-16 05:53:01.627491 | s_239078 | c_7954 | NaN | 0.0 | m | 1973 | 50 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 669730 | 0_2245 | 2021-08-25 09:06:03.504061 | s_80395 | c_131 | NaN | 0.0 | m | 1981 | 42 |
| 670682 | 0_2245 | 2022-03-06 19:59:19.462288 | s_175311 | c_4167 | NaN | 0.0 | f | 1979 | 44 |
| 671286 | 0_2245 | 2022-05-16 11:35:20.319501 | s_209381 | c_4453 | NaN | 0.0 | m | 1981 | 42 |
| 675679 | 0_2245 | 2022-02-11 09:05:43.952857 | s_163405 | c_1098 | NaN | 0.0 | m | 1986 | 37 |
| 677996 | 0_2245 | 2021-12-14 22:34:54.589921 | s_134446 | c_4854 | NaN | 0.0 | m | 1968 | 55 |
221 rows × 9 columns
In [31]:
donnees2[donnees2['categ']==0]['price'].median()
Out[31]:
9.99
In [32]:
donnees2= donnees2.fillna({'price':donnees2[donnees2['categ']==0]['price'].median()})
In [33]:
donnees2.isnull().sum()
Out[33]:
id_prod 0 date 0 session_id 0 client_id 0 price 0 categ 0 sex 0 birth 0 age 0 dtype: int64
3. Exportation de donnees2¶
In [34]:
donnees2_csv_data = donnees2.to_csv('donnees2.csv',index=False)
In [ ]: