projet : ancien énoncé ANALYSEZ LES VENTES DE LA LIBRAIRIE LAPAGE à L'AIDE DE PYTHON¶

Notebook 1 Préparation des données¶

0. Importation des données¶

In [3]:
import pandas as pd

import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt
from scipy import stats
In [4]:
produits = pd.read_csv('products.csv')
clients = pd.read_csv('customers.csv')
ventes = pd.read_csv('transactions.csv')
In [5]:
produits.head(5)
Out[5]:
id_prod price categ
0 0_1421 19.99 0
1 0_1368 5.13 0
2 0_731 17.99 0
3 1_587 4.99 1
4 0_1507 3.99 0
In [6]:
clients.head(5)
Out[6]:
client_id sex birth
0 c_4410 f 1967
1 c_7839 f 1975
2 c_1699 f 1984
3 c_5961 f 1962
4 c_5320 m 1943
In [7]:
ventes.head(5)
Out[7]:
id_prod date session_id client_id
0 0_1518 2022-05-20 13:21:29.043970 s_211425 c_103
1 1_251 2022-02-02 07:55:19.149409 s_158752 c_8534
2 0_1277 2022-06-18 15:44:33.155329 s_225667 c_6714
3 2_209 2021-06-24 04:19:29.835891 s_52962 c_6941
4 0_1509 2023-01-11 08:22:08.194479 s_325227 c_4232

1. Nettoyage des données¶

a. La table "produits"¶

Je vérifie qu'il n'y a pas de données dupliquées

In [8]:
produits.loc[produits['id_prod'].duplicated(keep=False),:]
Out[8]:
id_prod price categ

Pas de doublons: on peut utiliser "id_prod" comme une clé

Est-ce qu'il y a des valeurs manquantes ?

In [9]:
produits.isna()
Out[9]:
id_prod price categ
0 False False False
1 False False False
2 False False False
3 False False False
4 False False False
... ... ... ...
3282 False False False
3283 False False False
3284 False False False
3285 False False False
3286 False False False

3287 rows × 3 columns

In [10]:
(produits.isna().sum()/produits.shape[0]).sort_values(ascending=True)
Out[10]:
id_prod    0.0
price      0.0
categ      0.0
dtype: float64

b. La table " clients "¶

Je vérifie qu'il n'y a pas de données dupliquées

In [11]:
clients.loc[clients['client_id'].duplicated(keep=False),:]
Out[11]:
client_id sex birth

Pas de doublons: on peut utiliser "client_id" comme une clé

Est-ce qu'il y a des valeurs manquantes ?

In [12]:
clients.isna()
Out[12]:
client_id sex birth
0 False False False
1 False False False
2 False False False
3 False False False
4 False False False
... ... ... ...
8618 False False False
8619 False False False
8620 False False False
8621 False False False
8622 False False False

8623 rows × 3 columns

In [13]:
(clients.isna().sum()/clients.shape[0]).sort_values(ascending=True)
Out[13]:
client_id    0.0
sex          0.0
birth        0.0
dtype: float64

c. La table "ventes"¶

Je vérifie qu'il n'y a pas de données dupliquées

In [14]:
ventes.loc[ventes['session_id'].duplicated(keep=False),:]
Out[14]:
id_prod date session_id client_id
0 0_1518 2022-05-20 13:21:29.043970 s_211425 c_103
1 1_251 2022-02-02 07:55:19.149409 s_158752 c_8534
2 0_1277 2022-06-18 15:44:33.155329 s_225667 c_6714
4 0_1509 2023-01-11 08:22:08.194479 s_325227 c_4232
5 0_1418 2022-10-20 15:59:16.084030 s_285425 c_1478
... ... ... ... ...
679523 1_190 2022-01-30 06:17:15.265379 s_157285 c_656
679525 0_1434 2022-12-14 13:18:29.143723 s_311959 c_3653
679529 0_1425 2022-12-20 04:33:37.584749 s_314704 c_304
679530 0_1994 2021-07-16 20:36:35.350579 s_63204 c_2227
679531 1_523 2022-09-28 01:12:01.973763 s_274568 c_3873

516182 rows × 4 columns

In [15]:
ventes.duplicated().sum()
Out[15]:
126

Est-ce qu'il y a des valeurs manquantes ?

In [16]:
ventes.isna()
Out[16]:
id_prod date session_id client_id
0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False False False False
... ... ... ... ...
679527 False False False False
679528 False False False False
679529 False False False False
679530 False False False False
679531 False False False False

679532 rows × 4 columns

In [17]:
(ventes.isna().sum()/ventes.shape[0]).sort_values(ascending=True)
Out[17]:
id_prod       0.0
date          0.0
session_id    0.0
client_id     0.0
dtype: float64

2. Jointure¶

a) Jointure entre les tables ventes et produits¶

In [18]:
ventes_produits= pd.merge(ventes, produits, on= 'id_prod', how='left')
ventes_produits.head()
Out[18]:
id_prod date session_id client_id price categ
0 0_1518 2022-05-20 13:21:29.043970 s_211425 c_103 4.18 0.0
1 1_251 2022-02-02 07:55:19.149409 s_158752 c_8534 15.99 1.0
2 0_1277 2022-06-18 15:44:33.155329 s_225667 c_6714 7.99 0.0
3 2_209 2021-06-24 04:19:29.835891 s_52962 c_6941 69.99 2.0
4 0_1509 2023-01-11 08:22:08.194479 s_325227 c_4232 4.99 0.0

b) Jointure entre les tables ventes_produits et clients¶

In [19]:
donnees= pd.merge(ventes_produits,clients, on= 'client_id', how='left')
donnees.head()
Out[19]:
id_prod date session_id client_id price categ sex birth
0 0_1518 2022-05-20 13:21:29.043970 s_211425 c_103 4.18 0.0 f 1986
1 1_251 2022-02-02 07:55:19.149409 s_158752 c_8534 15.99 1.0 m 1988
2 0_1277 2022-06-18 15:44:33.155329 s_225667 c_6714 7.99 0.0 f 1968
3 2_209 2021-06-24 04:19:29.835891 s_52962 c_6941 69.99 2.0 m 2000
4 0_1509 2023-01-11 08:22:08.194479 s_325227 c_4232 4.99 0.0 m 1980

c) Suppression des test "-1"¶

In [20]:
donnees.describe()
Out[20]:
price categ birth
count 679311.000000 679311.000000 679532.000000
mean 17.449340 0.441823 1977.817964
std 18.329034 0.595002 13.578383
min -1.000000 0.000000 1929.000000
25% 8.820000 0.000000 1970.000000
50% 13.990000 0.000000 1980.000000
75% 18.990000 1.000000 1987.000000
max 300.000000 2.000000 2004.000000
In [21]:
donnees[donnees['price']==-1]
Out[21]:
id_prod date session_id client_id price categ sex birth
3019 T_0 test_2021-03-01 02:30:02.237419 s_0 ct_0 -1.0 0.0 f 2001
5138 T_0 test_2021-03-01 02:30:02.237425 s_0 ct_0 -1.0 0.0 f 2001
9668 T_0 test_2021-03-01 02:30:02.237437 s_0 ct_1 -1.0 0.0 m 2001
10728 T_0 test_2021-03-01 02:30:02.237436 s_0 ct_0 -1.0 0.0 f 2001
15292 T_0 test_2021-03-01 02:30:02.237430 s_0 ct_0 -1.0 0.0 f 2001
... ... ... ... ... ... ... ... ...
657830 T_0 test_2021-03-01 02:30:02.237417 s_0 ct_0 -1.0 0.0 f 2001
662081 T_0 test_2021-03-01 02:30:02.237427 s_0 ct_1 -1.0 0.0 m 2001
670680 T_0 test_2021-03-01 02:30:02.237449 s_0 ct_1 -1.0 0.0 m 2001
671647 T_0 test_2021-03-01 02:30:02.237424 s_0 ct_1 -1.0 0.0 m 2001
679180 T_0 test_2021-03-01 02:30:02.237425 s_0 ct_1 -1.0 0.0 m 2001

200 rows × 8 columns

In [22]:
donnees2=donnees[donnees['price']!=-1]
donnees2.head()
Out[22]:
id_prod date session_id client_id price categ sex birth
0 0_1518 2022-05-20 13:21:29.043970 s_211425 c_103 4.18 0.0 f 1986
1 1_251 2022-02-02 07:55:19.149409 s_158752 c_8534 15.99 1.0 m 1988
2 0_1277 2022-06-18 15:44:33.155329 s_225667 c_6714 7.99 0.0 f 1968
3 2_209 2021-06-24 04:19:29.835891 s_52962 c_6941 69.99 2.0 m 2000
4 0_1509 2023-01-11 08:22:08.194479 s_325227 c_4232 4.99 0.0 m 1980

d) Calcul de l'âge¶

In [23]:
donnees2['age']= pd.Timestamp.today().year - donnees['birth']
donnees2.head()
C:\Users\User\AppData\Local\Temp\ipykernel_13472\3288089109.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  donnees2['age']= pd.Timestamp.today().year - donnees['birth']
Out[23]:
id_prod date session_id client_id price categ sex birth age
0 0_1518 2022-05-20 13:21:29.043970 s_211425 c_103 4.18 0.0 f 1986 37
1 1_251 2022-02-02 07:55:19.149409 s_158752 c_8534 15.99 1.0 m 1988 35
2 0_1277 2022-06-18 15:44:33.155329 s_225667 c_6714 7.99 0.0 f 1968 55
3 2_209 2021-06-24 04:19:29.835891 s_52962 c_6941 69.99 2.0 m 2000 23
4 0_1509 2023-01-11 08:22:08.194479 s_325227 c_4232 4.99 0.0 m 1980 43

e) Traitement des " NAN "¶

In [25]:
donnees2['price'].info()
<class 'pandas.core.series.Series'>
Int64Index: 679332 entries, 0 to 679531
Series name: price
Non-Null Count   Dtype  
--------------   -----  
679111 non-null  float64
dtypes: float64(1)
memory usage: 10.4 MB
In [26]:
donnees2.isnull().sum()
Out[26]:
id_prod         0
date            0
session_id      0
client_id       0
price         221
categ         221
sex             0
birth           0
age             0
dtype: int64
In [27]:
donnees2[donnees2['categ'].isnull()] 
#faire un sort values sur id_prod 
Out[27]:
id_prod date session_id client_id price categ sex birth age
2633 0_2245 2022-09-23 07:22:38.636773 s_272266 c_4746 NaN NaN m 1940 83
10106 0_2245 2022-07-23 09:24:14.133889 s_242482 c_6713 NaN NaN f 1963 60
11727 0_2245 2022-12-03 03:26:35.696673 s_306338 c_5108 NaN NaN m 1978 45
15675 0_2245 2021-08-16 11:33:25.481411 s_76493 c_1391 NaN NaN m 1991 32
16377 0_2245 2022-07-16 05:53:01.627491 s_239078 c_7954 NaN NaN m 1973 50
... ... ... ... ... ... ... ... ... ...
669730 0_2245 2021-08-25 09:06:03.504061 s_80395 c_131 NaN NaN m 1981 42
670682 0_2245 2022-03-06 19:59:19.462288 s_175311 c_4167 NaN NaN f 1979 44
671286 0_2245 2022-05-16 11:35:20.319501 s_209381 c_4453 NaN NaN m 1981 42
675679 0_2245 2022-02-11 09:05:43.952857 s_163405 c_1098 NaN NaN m 1986 37
677996 0_2245 2021-12-14 22:34:54.589921 s_134446 c_4854 NaN NaN m 1968 55

221 rows × 9 columns

In [28]:
donnees2= donnees2.fillna({'categ':0.0})
In [29]:
donnees2.isnull().sum()
Out[29]:
id_prod         0
date            0
session_id      0
client_id       0
price         221
categ           0
sex             0
birth           0
age             0
dtype: int64
In [30]:
donnees2[donnees2['price'].isnull()]
Out[30]:
id_prod date session_id client_id price categ sex birth age
2633 0_2245 2022-09-23 07:22:38.636773 s_272266 c_4746 NaN 0.0 m 1940 83
10106 0_2245 2022-07-23 09:24:14.133889 s_242482 c_6713 NaN 0.0 f 1963 60
11727 0_2245 2022-12-03 03:26:35.696673 s_306338 c_5108 NaN 0.0 m 1978 45
15675 0_2245 2021-08-16 11:33:25.481411 s_76493 c_1391 NaN 0.0 m 1991 32
16377 0_2245 2022-07-16 05:53:01.627491 s_239078 c_7954 NaN 0.0 m 1973 50
... ... ... ... ... ... ... ... ... ...
669730 0_2245 2021-08-25 09:06:03.504061 s_80395 c_131 NaN 0.0 m 1981 42
670682 0_2245 2022-03-06 19:59:19.462288 s_175311 c_4167 NaN 0.0 f 1979 44
671286 0_2245 2022-05-16 11:35:20.319501 s_209381 c_4453 NaN 0.0 m 1981 42
675679 0_2245 2022-02-11 09:05:43.952857 s_163405 c_1098 NaN 0.0 m 1986 37
677996 0_2245 2021-12-14 22:34:54.589921 s_134446 c_4854 NaN 0.0 m 1968 55

221 rows × 9 columns

In [31]:
donnees2[donnees2['categ']==0]['price'].median()
Out[31]:
9.99
In [32]:
donnees2= donnees2.fillna({'price':donnees2[donnees2['categ']==0]['price'].median()})
In [33]:
donnees2.isnull().sum()
Out[33]:
id_prod       0
date          0
session_id    0
client_id     0
price         0
categ         0
sex           0
birth         0
age           0
dtype: int64

3. Exportation de donnees2¶

In [34]:
donnees2_csv_data = donnees2.to_csv('donnees2.csv',index=False)
In [ ]: