8  Gestão e análise de dados

8.1 Introdução

8.1.1 Pré-requisitos

Para trabalhar com funções e todas as suas funcionalidades, utilizaremos dois pacotes externos:

  • numpy
  • pandas
  • os
import numpy as np
import pandas as pd
import os

8.2 O que é o Pandas?

Pandas é uma biblioteca que contém objetos específicos e ferramentas próprias a esses objetos que permitem realizar limpeza, organização e análise de dados de forma bastante eficiente no Python. Não à toa, é uma das bibliotecas mais utilizadas na linguagem hoje em dia, um pré-requisito para a maior parte das aplicações de ciência dos dados e aprendizado de máquina.

Embora o Pandas se utilize em grande medida da estrutura e lógica do NumPy e seus arrays, a biblioteca é projetada para trabalhar com dados tabulares e heterogêneos. Como já vimos, o NumPy é mais adequado para trabalhar com dados homogêneos e em sua maior parte numéricos.

São dois os objetos particulares ao Pandas: Series e DataFrame.

Você pode pensar em uma série como uma “coluna” de dados, como uma sequência de observações em uma única variável. Por outro lado, um DataFrame é um objeto bidimensional para armazenar colunas de dados relacionadas, assim como uma tabela, com linhas e colunas, no Microsoft Excel.

8.3 Objetos no Pandas

Vamos começar criando duas listas: uma contendo o número de pessoas ocupadas (em mil pessoas) no Brasil de 2012 a 2021 e outra contendo exatamente o intervalo de anos.

pessoas_ocupadas = [90593,92170,92962,92366,90174,92228,93534,95515,87225,95747]
anos = list(range(2012,2021+1))

print(pessoas_ocupadas)
print(anos)
[90593, 92170, 92962, 92366, 90174, 92228, 93534, 95515, 87225, 95747]
[2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]

Para criar uma série basta utilizar a função pandas.Series, que recebe como argumento os valores da série e os índices da série, que serão justamente os valores que identificam cada linha.

series_pessoas_ocup = pd.Series(data=pessoas_ocupadas,index=anos)

print(type(series_pessoas_ocup),'\n')
print(series_pessoas_ocup)
<class 'pandas.core.series.Series'> 

2012    90593
2013    92170
2014    92962
2015    92366
2016    90174
2017    92228
2018    93534
2019    95515
2020    87225
2021    95747
dtype: int64

Já para o caso de um DataFrame, podemos criá-lo através da função pandas.DataFrame. Essa função recebe os dados que irão compor as colunas de dados de diversas formas, mas umas das mais intuitivas é através da utilização de dicionários. Seguindo o exemplo anterior, vamos criar um DataFrame com as informações de pessoas ocupadas, desocupadas, dentro e fora da força de trabalho.

Primeiro passo é definir o dicionário com esses dados.

dados = {
    'ocupadas': [90593,92170,92962,92366,90174,92228,93534,95515,87225,95747],
    'desocupadas': [6730,6151,6555,9222,12476,12453,12413,11903,14412,12011],
    'na forca': [97322,98321,99516,101588,102650,104682,105947,107418,101637,107758],
    'fora da forca': [58007,59244,60162,60092,60953,60777,61299,61579,69042,64525]
}

print(dados)
{'ocupadas': [90593, 92170, 92962, 92366, 90174, 92228, 93534, 95515, 87225, 95747], 'desocupadas': [6730, 6151, 6555, 9222, 12476, 12453, 12413, 11903, 14412, 12011], 'na forca': [97322, 98321, 99516, 101588, 102650, 104682, 105947, 107418, 101637, 107758], 'fora da forca': [58007, 59244, 60162, 60092, 60953, 60777, 61299, 61579, 69042, 64525]}

Agora basta definir o DataFrame usando como argumento esse dicionários dados.

dados_emprego = pd.DataFrame(data=dados,index=anos)

print(type(dados_emprego),'\n')
dados_emprego
<class 'pandas.core.frame.DataFrame'> 
ocupadas desocupadas na forca fora da forca
2012 90593 6730 97322 58007
2013 92170 6151 98321 59244
2014 92962 6555 99516 60162
2015 92366 9222 101588 60092
2016 90174 12476 102650 60953
2017 92228 12453 104682 60777
2018 93534 12413 105947 61299
2019 95515 11903 107418 61579
2020 87225 14412 101637 69042
2021 95747 12011 107758 64525

8.4 DataFrames: o coração do Pandas

8.4.1 Importação de dados externos

Agora que já apresentamos os dois tipos de objetos e como criá-los na mão, vamos avançar e trabalhar com dados trazidos de algum arquivo externo. É comum nos depararmos com arquivos do tipo .csv e é com ele que trabalharemos no Pandas, embora o Pandas seja flexível e nos permita trazer para dentro do Python arquivos de várias terminações como .txt, .xlsx, .json, .dta, etc.

A ideia daqui para frente é trabalhar com os dados do site Our World in Data sobre número de casos de covid-19, mortes e excesso de mortes atrelados à doença, e número de doses de vacina aplicadas ao redor do mundo desde o início da pandemia, em fevereiro de 2020. Esses dados são atualizados semanalmente e disponibilizados na plataforma Kaggle. O arquivo específico com o qual trabalharemos está disponível no Moodle e deve ser baixado para a sua máquina local.

Mas antes de aprendermos como trabalhar com um dataframe e as funcionalidades que o Pandas nos traz precisamos ler o arquivo com os nossos dados. Comecemos pela função os.chdir do pacote os para mudar o diretório base do Python para o diretório no qual o nosso arquivo se encontra:

import os

# Qual o diretório no qual o Python está trabalhando? A função getcwd() nos diz isso
print('Diretório inicial: {}'.format(os.getcwd()))

# Vamos mudar para o nosso diretório atual
os.chdir('D:/Dropbox/Projetos e trabalhos/FEAUSP/Cursos/EAE1106 - Metodos Computacionais para Economistas/Aulas/2024_2')
print('Diretório final: {}'.format(os.getcwd()))
Diretório inicial: D:\Dropbox\Projetos e trabalhos\FEAUSP\Cursos\EAE1106 - Metodos Computacionais para Economistas\Book\eae1106
Diretório final: D:\Dropbox\Projetos e trabalhos\FEAUSP\Cursos\EAE1106 - Metodos Computacionais para Economistas\Aulas\2024_2

Quais são os arquivos disponíveis nesse diretório? A função os.listdir() lista as pastas e arquivos disponíveis no diretório que estamos trabalhando.

os.listdir()
['df_resultado_aula_pandas.csv',
 'EAE1106 - Aula Fundamentos.html',
 'EAE1106 - Aula Fundamentos.pdf',
 'EAE1106 - Aula Funções.html',
 'EAE1106 - Aula Funções.pdf',
 'EAE1106 - Aula Instalação.html',
 'EAE1106 - Aula Instalação.pdf',
 'EAE1106 - Aula Iteração.html',
 'EAE1106 - Aula Iteração.pdf',
 'EAE1106 - Aula Matplotlib.html',
 'EAE1106 - Aula Matplotlib.pdf',
 'EAE1106 - Aula NumPy.html',
 'EAE1106 - Aula NumPy.pdf',
 'EAE1106 - Aula Objetos.html',
 'EAE1106 - Aula Objetos.pdf',
 'EAE1106 - Aula Pandas.html',
 'EAE1106 - Aula Pandas.pdf',
 'EAE1106_Fundamentos_slides.aux',
 'EAE1106_Fundamentos_slides.log',
 'EAE1106_Fundamentos_slides.nav',
 'EAE1106_Fundamentos_slides.out',
 'EAE1106_Fundamentos_slides.pdf',
 'EAE1106_Fundamentos_slides.snm',
 'EAE1106_Fundamentos_slides.synctex.gz',
 'EAE1106_Fundamentos_slides.tex',
 'EAE1106_Fundamentos_slides.toc',
 'EAE1106_Introducao_slides.aux',
 'EAE1106_Introducao_slides.log',
 'EAE1106_Introducao_slides.nav',
 'EAE1106_Introducao_slides.out',
 'EAE1106_Introducao_slides.pdf',
 'EAE1106_Introducao_slides.snm',
 'EAE1106_Introducao_slides.synctex.gz',
 'EAE1106_Introducao_slides.tex',
 'EAE1106_Introducao_slides.toc',
 'notebook_images',
 'owid-covid-data_top10.csv']

Legal, Agora utilizaremos a função pandas.read_csv para ler o arquivo owid-covid-data_top10.csv. Esse arquivo contém as informações para os 10 países com o maior número reportado de casos de covid-19.

df_covid = pd.read_csv('owid-covid-data_top10.csv', sep=',', encoding='utf8')
type(df_covid)
pandas.core.frame.DataFrame

8.4.2 Características básicas

Alguns métodos específicos ao objeto DataFrame nos são incrivelmente úteis quando queremos ter uma leitura rápida do que acabamos de gerar.

  • df.info() nos retorna algumas informações técnicas do dataframe df, como o formato dos dados presentes em cada coluna.
  • df.head(n) e df.tail(n) nos retornam as n primeiras e n últimas linhas, respectivamente.
  • df.describe() nos retorna uma tabela de estatísticas descritivas das colunas numéricas.

Vejamos o que esses métodos nos retornam no nosso caso:

df_covid.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7663 entries, 0 to 7662
Data columns (total 67 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   iso_code                                    7663 non-null   object 
 1   continent                                   7663 non-null   object 
 2   location                                    7663 non-null   object 
 3   date                                        7663 non-null   object 
 4   total_cases                                 7663 non-null   float64
 5   new_cases                                   7641 non-null   float64
 6   new_cases_smoothed                          7484 non-null   float64
 7   total_deaths                                7345 non-null   float64
 8   new_deaths                                  7327 non-null   float64
 9   new_deaths_smoothed                         7173 non-null   float64
 10  total_cases_per_million                     7663 non-null   float64
 11  new_cases_per_million                       7641 non-null   float64
 12  new_cases_smoothed_per_million              7484 non-null   float64
 13  total_deaths_per_million                    7345 non-null   float64
 14  new_deaths_per_million                      7327 non-null   float64
 15  new_deaths_smoothed_per_million             7173 non-null   float64
 16  reproduction_rate                           7292 non-null   float64
 17  icu_patients                                4278 non-null   float64
 18  icu_patients_per_million                    4278 non-null   float64
 19  hosp_patients                               2861 non-null   float64
 20  hosp_patients_per_million                   2861 non-null   float64
 21  weekly_icu_admissions                       1377 non-null   float64
 22  weekly_icu_admissions_per_million           1377 non-null   float64
 23  weekly_hosp_admissions                      3603 non-null   float64
 24  weekly_hosp_admissions_per_million          3603 non-null   float64
 25  new_tests                                   5755 non-null   float64
 26  total_tests                                 6056 non-null   float64
 27  total_tests_per_thousand                    6056 non-null   float64
 28  new_tests_per_thousand                      5755 non-null   float64
 29  new_tests_smoothed                          7116 non-null   float64
 30  new_tests_smoothed_per_thousand             7116 non-null   float64
 31  positive_rate                               5812 non-null   float64
 32  tests_per_case                              5812 non-null   float64
 33  tests_units                                 7186 non-null   object 
 34  total_vaccinations                          3950 non-null   float64
 35  people_vaccinated                           3938 non-null   float64
 36  people_fully_vaccinated                     3860 non-null   float64
 37  total_boosters                              2070 non-null   float64
 38  new_vaccinations                            3858 non-null   float64
 39  new_vaccinations_smoothed                   4181 non-null   float64
 40  total_vaccinations_per_hundred              3950 non-null   float64
 41  people_vaccinated_per_hundred               3938 non-null   float64
 42  people_fully_vaccinated_per_hundred         3860 non-null   float64
 43  total_boosters_per_hundred                  2070 non-null   float64
 44  new_vaccinations_smoothed_per_million       4181 non-null   float64
 45  new_people_vaccinated_smoothed              4181 non-null   float64
 46  new_people_vaccinated_smoothed_per_hundred  4181 non-null   float64
 47  stringency_index                            7580 non-null   float64
 48  population                                  7663 non-null   float64
 49  population_density                          7663 non-null   float64
 50  median_age                                  7663 non-null   float64
 51  aged_65_older                               7663 non-null   float64
 52  aged_70_older                               7663 non-null   float64
 53  gdp_per_capita                              7663 non-null   float64
 54  extreme_poverty                             5348 non-null   float64
 55  cardiovasc_death_rate                       7663 non-null   float64
 56  diabetes_prevalence                         7663 non-null   float64
 57  female_smokers                              7663 non-null   float64
 58  male_smokers                                7663 non-null   float64
 59  handwashing_facilities                      766 non-null    float64
 60  hospital_beds_per_thousand                  7663 non-null   float64
 61  life_expectancy                             7663 non-null   float64
 62  human_development_index                     7663 non-null   float64
 63  excess_mortality_cumulative_absolute        692 non-null    float64
 64  excess_mortality_cumulative                 692 non-null    float64
 65  excess_mortality                            692 non-null    float64
 66  excess_mortality_cumulative_per_million     692 non-null    float64
dtypes: float64(62), object(5)
memory usage: 5.7 MB
df_covid.head(2) # método para apresentar apenas as primeiras duas linhas do dataframe
iso_code continent location date total_cases new_cases new_cases_smoothed total_deaths new_deaths new_deaths_smoothed ... female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy human_development_index excess_mortality_cumulative_absolute excess_mortality_cumulative excess_mortality excess_mortality_cumulative_per_million
0 BRA South America Brazil 2020-02-26 1.0 1.0 NaN NaN NaN NaN ... 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN
1 BRA South America Brazil 2020-02-27 1.0 0.0 NaN NaN NaN NaN ... 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN

2 rows × 67 columns

Ué, mas não são 67 colunas? Onde estão todas elas? Por padrão o pandas mostra apenas uma parcela do total, para alterar essas opções basta utilizar a função pd.set_option e mudar o argumento relacionado ao número máximo de colunas.

pd.set_option('display.max_columns', 100)
df_covid.head(2)
iso_code continent location date total_cases new_cases new_cases_smoothed total_deaths new_deaths new_deaths_smoothed total_cases_per_million new_cases_per_million new_cases_smoothed_per_million total_deaths_per_million new_deaths_per_million new_deaths_smoothed_per_million reproduction_rate icu_patients icu_patients_per_million hosp_patients hosp_patients_per_million weekly_icu_admissions weekly_icu_admissions_per_million weekly_hosp_admissions weekly_hosp_admissions_per_million new_tests total_tests total_tests_per_thousand new_tests_per_thousand new_tests_smoothed new_tests_smoothed_per_thousand positive_rate tests_per_case tests_units total_vaccinations people_vaccinated people_fully_vaccinated total_boosters new_vaccinations new_vaccinations_smoothed total_vaccinations_per_hundred people_vaccinated_per_hundred people_fully_vaccinated_per_hundred total_boosters_per_hundred new_vaccinations_smoothed_per_million new_people_vaccinated_smoothed new_people_vaccinated_smoothed_per_hundred stringency_index population population_density median_age aged_65_older aged_70_older gdp_per_capita extreme_poverty cardiovasc_death_rate diabetes_prevalence female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy human_development_index excess_mortality_cumulative_absolute excess_mortality_cumulative excess_mortality excess_mortality_cumulative_per_million
0 BRA South America Brazil 2020-02-26 1.0 1.0 NaN NaN NaN NaN 0.005 0.005 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5.56 213993441.0 25.04 33.5 8.552 5.06 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN
1 BRA South America Brazil 2020-02-27 1.0 0.0 NaN NaN NaN NaN 0.005 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5.56 213993441.0 25.04 33.5 8.552 5.06 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN
df_covid.describe()
total_cases new_cases new_cases_smoothed total_deaths new_deaths new_deaths_smoothed total_cases_per_million new_cases_per_million new_cases_smoothed_per_million total_deaths_per_million new_deaths_per_million new_deaths_smoothed_per_million reproduction_rate icu_patients icu_patients_per_million hosp_patients hosp_patients_per_million weekly_icu_admissions weekly_icu_admissions_per_million weekly_hosp_admissions weekly_hosp_admissions_per_million new_tests total_tests total_tests_per_thousand new_tests_per_thousand new_tests_smoothed new_tests_smoothed_per_thousand positive_rate tests_per_case total_vaccinations people_vaccinated people_fully_vaccinated total_boosters new_vaccinations new_vaccinations_smoothed total_vaccinations_per_hundred people_vaccinated_per_hundred people_fully_vaccinated_per_hundred total_boosters_per_hundred new_vaccinations_smoothed_per_million new_people_vaccinated_smoothed new_people_vaccinated_smoothed_per_hundred stringency_index population population_density median_age aged_65_older aged_70_older gdp_per_capita extreme_poverty cardiovasc_death_rate diabetes_prevalence female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy human_development_index excess_mortality_cumulative_absolute excess_mortality_cumulative excess_mortality excess_mortality_cumulative_per_million
count 7.663000e+03 7.641000e+03 7484.000000 7345.000000 7327.000000 7173.000000 7663.000000 7641.000000 7484.000000 7345.000000 7327.000000 7173.000000 7292.00000 4278.000000 4278.000000 2861.000000 2861.000000 1377.000000 1377.000000 3603.000000 3603.000000 5.755000e+03 6.056000e+03 6056.000000 5755.000000 7.116000e+03 7116.000000 5812.000000 5812.000000 3.950000e+03 3.938000e+03 3.860000e+03 2.070000e+03 3.858000e+03 4.181000e+03 3950.000000 3938.000000 3860.000000 2070.000000 4181.000000 4.181000e+03 4181.000000 7580.000000 7.663000e+03 7663.000000 7663.000000 7663.000000 7663.000000 7663.000000 5348.000000 7663.000000 7663.000000 7663.000000 7663.000000 7.660000e+02 7663.000000 7663.000000 7663.000000 6.920000e+02 692.000000 692.000000 692.000000
mean 7.413453e+06 3.248569e+04 32590.349244 143683.851464 420.451071 423.776663 40129.002883 218.335670 215.706487 849.127797 2.393236 2.408332 1.10120 3303.139551 24.102691 22806.165676 194.417561 1051.387073 15.427283 14677.353039 104.734778 4.813446e+05 1.227599e+08 770.611446 3.373291 4.273001e+05 3.038345 0.057080 56.352392 1.733269e+08 9.999566e+07 6.865947e+07 1.426021e+07 9.534017e+05 9.135433e+05 87.762058 45.944375 36.591272 13.227155 4397.336283 4.408262e+05 0.183069 58.892639 2.540772e+08 224.299485 40.885972 16.819120 11.279442 33434.062213 4.037939 168.227080 7.011014 17.022224 31.774449 5.955000e+01 5.882298 79.370521 0.867278 1.277382e+05 7.168382 10.014639 868.361431
std 1.200763e+07 6.990790e+04 66130.926542 193260.956591 656.855964 611.463015 53182.233431 514.898513 475.288902 856.364728 3.168872 2.914506 0.39357 5453.474130 22.399457 26308.573419 149.807859 842.146810 12.298781 24168.937379 98.726733 5.613885e+05 1.801855e+08 1131.024910 4.573386 5.000443e+05 3.982051 0.063095 96.826696 2.828810e+08 1.715422e+08 1.136508e+08 2.068579e+07 1.653290e+06 1.453322e+06 65.615643 28.826050 28.475631 18.788789 3093.759435 8.317038e+05 0.181709 17.688749 3.887761e+08 171.340671 6.045539 6.145592 4.649445 13596.356671 7.104167 104.592235 2.204401 8.869313 11.113599 4.550445e-13 4.150498 4.790136 0.090695 2.181352e+05 7.737350 16.272627 1006.519309
min 1.000000e+00 0.000000e+00 0.000000 1.000000 0.000000 0.000000 0.001000 0.000000 0.000000 0.001000 0.000000 0.000000 0.19000 8.000000 0.156000 127.000000 2.104000 34.000000 0.563000 24.000000 0.286000 4.000000e+00 4.000000e+00 0.000000 0.000000 9.000000e+00 0.000000 0.000500 2.000000 0.000000e+00 0.000000e+00 1.000000e+00 1.000000e+00 0.000000e+00 4.440000e+02 0.000000 0.000000 0.000000 0.000000 4.000000 2.390000e+02 0.000000 0.000000 5.130518e+07 8.823000 28.200000 5.989000 3.414000 6426.674000 0.100000 79.370000 4.280000 1.900000 17.900000 5.955000e+01 0.530000 69.660000 0.645000 -3.772610e+04 -9.020000 -19.200000 -299.292834
25% 2.297820e+05 1.333000e+03 1440.107250 10145.000000 23.000000 27.857000 2831.535000 12.022000 12.578000 98.069000 0.214000 0.242000 0.90000 348.000000 5.410000 6009.000000 78.055000 323.000000 4.903000 1932.500000 28.734000 6.352250e+04 8.584362e+06 75.098750 0.529000 6.720050e+04 0.548000 0.016375 14.300000 2.813928e+07 2.034653e+07 8.608600e+06 1.092000e+04 1.893440e+05 2.088040e+05 22.652500 15.930000 6.700000 0.012500 1946.000000 4.624100e+04 0.050000 47.690000 6.742200e+07 35.608000 38.300000 13.914000 8.622000 24765.954000 0.200000 86.060000 4.780000 10.100000 24.600000 5.955000e+01 2.540000 75.880000 0.824000 -1.335500e+02 -0.052500 -0.252500 -2.603051
50% 2.702681e+06 9.921000e+03 11149.928500 83976.000000 161.000000 177.714000 17491.567000 62.362000 67.154500 554.520000 0.977000 1.106000 1.03000 1134.000000 15.918000 14665.000000 138.495000 927.000000 13.779000 6417.000000 84.415000 2.430530e+05 4.054337e+07 285.623500 1.642000 2.152790e+05 1.811500 0.040500 24.700000 9.093913e+07 4.748998e+07 4.155760e+07 3.335546e+06 4.347705e+05 4.510670e+05 87.745000 52.570000 36.935000 2.520000 3696.000000 1.679190e+05 0.112000 60.650000 8.390047e+07 237.016000 42.000000 18.517000 12.527000 38605.671000 1.200000 122.137000 6.180000 19.800000 33.100000 5.955000e+01 5.980000 81.330000 0.916000 5.249990e+04 6.900000 5.745000 707.315534
75% 7.906961e+06 3.357100e+04 35266.357250 149512.000000 535.500000 567.143000 64999.770000 209.744000 216.940500 1631.352000 3.464500 3.727000 1.22000 3489.750000 39.497000 28951.000000 307.959000 1584.000000 21.595000 12810.000000 144.512000 7.657615e+05 1.513048e+08 1030.061250 4.201500 5.807640e+05 3.735250 0.069800 61.050000 1.591653e+08 8.801805e+07 6.110446e+07 2.359063e+07 9.227180e+05 8.611230e+05 140.575000 73.237500 64.760000 22.517500 6395.000000 3.894540e+05 0.261000 71.760000 2.139934e+08 347.778000 46.600000 21.453000 15.957000 39753.244000 3.400000 177.961000 8.310000 23.400000 35.600000 5.955000e+01 8.050000 83.030000 0.926000 1.191370e+05 13.530000 14.987500 1517.147097
max 7.926573e+07 1.368167e+06 802518.429000 958437.000000 4529.000000 4190.000000 342095.547000 7453.161000 5436.723000 3047.832000 26.683000 18.308000 6.14000 28891.000000 104.105000 154536.000000 637.877000 4838.000000 71.757000 154696.000000 839.129000 3.740296e+06 8.206194e+08 6842.977000 32.956000 3.080396e+06 27.372000 0.491300 1867.100000 1.786849e+09 9.671539e+08 8.003048e+08 9.492562e+07 1.862727e+07 1.003800e+07 232.900000 87.460000 86.520000 62.670000 17164.000000 6.785334e+06 1.183000 100.000000 1.393409e+09 527.967000 48.200000 27.049000 18.493000 54225.446000 21.200000 431.297000 10.790000 30.100000 58.300000 5.955000e+01 13.050000 84.630000 0.947000 1.080748e+06 30.850000 107.250000 7406.847532

Uma outra coisa que pode ser bastante interessante é saber quantas vezes os valores de determinada variável se repetem. Quantas vezes cada país da base de dados aparece no DataFrame?

df_covid['location'].value_counts()
location
Japan             774
South Korea       774
United States     774
France            772
Germany           769
India             766
Italy             765
Russia            765
United Kingdom    765
Brazil            739
Name: count, dtype: int64
df_covid['location'].value_counts(normalize=True,dropna=False)
location
Japan             0.101005
South Korea       0.101005
United States     0.101005
France            0.100744
Germany           0.100352
India             0.099961
Italy             0.099830
Russia            0.099830
United Kingdom    0.099830
Brazil            0.096437
Name: proportion, dtype: float64

Por fim, podemos ordenar o DataFrame de acordo com uma coluna ou um conjunto de colunas. Qual eram os países com o maior número de novos casos de covid reportados no dia 1 de março de 2022?

df_covid[df_covid['date']=='2022-03-01'].sort_values(by=['new_cases'], ascending=False)
iso_code continent location date total_cases new_cases new_cases_smoothed total_deaths new_deaths new_deaths_smoothed total_cases_per_million new_cases_per_million new_cases_smoothed_per_million total_deaths_per_million new_deaths_per_million new_deaths_smoothed_per_million reproduction_rate icu_patients icu_patients_per_million hosp_patients hosp_patients_per_million weekly_icu_admissions weekly_icu_admissions_per_million weekly_hosp_admissions weekly_hosp_admissions_per_million new_tests total_tests total_tests_per_thousand new_tests_per_thousand new_tests_smoothed new_tests_smoothed_per_thousand positive_rate tests_per_case tests_units total_vaccinations people_vaccinated people_fully_vaccinated total_boosters new_vaccinations new_vaccinations_smoothed total_vaccinations_per_hundred people_vaccinated_per_hundred people_fully_vaccinated_per_hundred total_boosters_per_hundred new_vaccinations_smoothed_per_million new_people_vaccinated_smoothed new_people_vaccinated_smoothed_per_hundred stringency_index population population_density median_age aged_65_older aged_70_older gdp_per_capita extreme_poverty cardiovasc_death_rate diabetes_prevalence female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy human_development_index excess_mortality_cumulative_absolute excess_mortality_cumulative excess_mortality excess_mortality_cumulative_per_million
6119 KOR Asia South Korea 2022-03-01 3492686.0 219237.0 166214.857 8266.0 96.0 94.143 68076.668 4273.194 3239.728 161.114 1.871 1.835 1.34 727.0 14.170 NaN NaN NaN NaN 12606.0 245.706 717980.0 62135661.0 1211.099 13.994 511822.0 9.976 0.3248 3.1 people tested 1.192321e+08 44862508.0 44372141.0 31512659.0 5367.0 119659.0 232.40 87.44 86.49 61.42 2332.0 4582.0 0.009 40.74 5.130518e+07 527.967 43.4 13.914 8.622 35938.374 0.2 85.998 6.80 6.2 40.9 NaN 12.27 83.03 0.916 NaN NaN NaN NaN
6884 GBR Europe United Kingdom 2022-03-01 19036570.0 99485.0 41712.714 161773.0 269.0 116.857 279099.479 1458.572 611.560 2371.791 3.944 1.713 1.05 278.0 4.076 10776.0 157.989 NaN NaN NaN NaN 673040.0 465243695.0 6821.044 9.868 653726.0 9.584 0.0638 15.7 tests performed 1.399278e+08 52640258.0 49029528.0 38258018.0 43493.0 48021.0 205.15 77.18 71.88 56.09 704.0 7281.0 0.011 NaN 6.820711e+07 272.898 40.8 18.517 12.527 39753.244 0.2 122.137 4.28 20.0 24.7 NaN 2.54 81.32 0.932 NaN NaN NaN NaN
5345 RUS Europe Russia 2022-03-01 16257688.0 96092.0 118164.000 345427.0 772.0 760.857 111421.169 658.561 809.830 2367.365 5.291 5.214 0.78 NaN NaN NaN NaN NaN NaN 81119.0 555.945 272654.0 276734652.0 1896.586 1.869 414545.0 2.841 0.2850 3.5 tests performed NaN NaN NaN NaN NaN 7425.0 NaN NaN NaN NaN 51.0 27479.0 0.019 NaN 1.459120e+08 8.823 39.6 14.178 9.393 24765.954 0.1 431.297 6.18 23.4 58.3 NaN 8.05 72.58 0.824 NaN NaN NaN NaN
1506 FRA Europe France 2022-03-01 22835320.0 79794.0 54468.000 138645.0 209.0 185.714 338692.415 1183.501 807.867 2056.376 3.100 2.755 0.67 2408.0 35.715 24437.0 362.448 940.0 13.942 7865.0 116.653 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.408727e+08 53961132.0 52317204.0 35672010.0 53652.0 63196.0 208.94 80.03 77.60 52.91 937.0 3135.0 0.005 NaN 6.742200e+07 122.578 42.0 19.718 13.079 38605.671 NaN 86.060 4.77 30.1 35.6 NaN 5.98 82.66 0.901 NaN NaN NaN NaN
4580 JPN Asia Japan 2022-03-01 5071249.0 65368.0 65592.143 23905.0 238.0 223.714 40231.789 518.585 520.363 189.646 1.888 1.775 0.97 NaN NaN NaN NaN NaN NaN NaN NaN 168484.0 37629310.0 298.525 1.337 161137.0 1.278 0.4071 2.5 people tested NaN NaN NaN NaN NaN 985076.0 NaN NaN NaN NaN 7815.0 12819.0 0.010 NaN 1.260508e+08 347.778 48.2 27.049 18.493 39002.223 NaN 79.370 5.72 11.2 33.7 NaN 13.05 84.63 0.919 NaN NaN NaN NaN
2275 DEU Europe Germany 2022-03-01 14974722.0 62096.0 135435.286 123024.0 75.0 159.429 178481.978 740.115 1614.237 1466.309 0.894 1.900 0.82 2288.0 27.270 NaN NaN 1525.0 18.176 7443.0 88.712 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.700861e+08 63472773.0 62753256.0 47462955.0 110964.0 100482.0 202.72 75.65 74.79 56.57 1198.0 8140.0 0.010 73.15 8.390047e+07 237.016 46.6 21.453 15.957 45229.245 NaN 156.139 8.31 28.2 33.1 NaN 8.00 81.33 0.947 NaN NaN NaN NaN
3806 ITA Europe Italy 2022-03-01 12829972.0 47136.0 39339.429 155000.0 233.0 212.571 212531.216 780.818 651.666 2567.608 3.860 3.521 0.76 708.0 11.728 11164.0 184.934 378.0 6.262 NaN NaN 530858.0 187867237.0 3112.061 8.794 412189.0 6.828 0.0954 10.5 tests performed 1.342220e+08 50627970.0 47502445.0 37598486.0 84125.0 108220.0 222.34 83.87 78.69 62.28 1793.0 5584.0 0.009 63.89 6.036747e+07 205.859 47.9 23.021 16.240 35220.084 2.0 113.151 4.78 19.8 27.8 NaN 3.18 83.51 0.892 NaN NaN NaN NaN
7658 USA North America United States 2022-03-01 79091361.0 47031.0 62331.429 952423.0 1691.0 1830.000 237572.183 141.270 187.229 2860.859 5.079 5.497 0.59 7314.0 21.970 36214.0 108.778 NaN NaN 32132.0 96.517 NaN NaN NaN NaN NaN NaN NaN NaN NaN 5.544755e+08 253899142.0 215166280.0 94753028.0 241635.0 264689.0 167.01 76.47 64.81 28.54 797.0 69472.0 0.021 NaN 3.329151e+08 35.608 38.3 15.413 9.732 54225.446 1.2 151.089 10.79 19.1 24.6 NaN 2.77 78.86 0.926 NaN NaN NaN NaN
734 BRA South America Brazil 2022-03-01 28818850.0 22279.0 65271.286 649922.0 246.0 598.143 134671.651 104.111 305.015 3037.112 1.150 2.795 0.48 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 19430.0 0.091 NaN NaN tests performed 3.929256e+08 177836868.0 155071690.0 65073212.0 161992.0 814209.0 183.62 83.10 72.47 30.41 3805.0 173043.0 0.081 NaN 2.139934e+08 25.040 33.5 8.552 5.060 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.20 75.88 0.765 NaN NaN NaN NaN
3041 IND Asia India 2022-03-01 42938599.0 7554.0 10224.000 514246.0 223.0 232.000 30815.502 5.421 7.337 369.056 0.160 0.166 0.35 NaN NaN NaN NaN NaN NaN NaN NaN 901647.0 768382993.0 551.441 0.647 1021773.0 0.733 0.0100 99.9 samples tested 1.776674e+09 965923463.0 791803734.0 18947182.0 1495732.0 2450184.0 127.51 69.32 56.82 1.36 1758.0 343712.0 0.025 NaN 1.393409e+09 450.419 28.2 5.989 3.414 6426.674 21.2 282.280 10.39 1.9 20.6 59.55 0.53 69.66 0.645 NaN NaN NaN NaN

Note que no código acima utilizamos uma forma de filtragem dos dados com base em uma condição, que é a data ser igual a 01 de março de 2022. Vamos ir com calma e mostrar todos os passos para realizar indexação, seleção e filtragem em dataframes agora.

8.4.3 Indexação, seleção e filtragem

Podemos selecionar um intervalo particular de linhas do DataFrame usando a lógica de indexação de sempre.

df_covid[0:2]
iso_code continent location date total_cases new_cases new_cases_smoothed total_deaths new_deaths new_deaths_smoothed total_cases_per_million new_cases_per_million new_cases_smoothed_per_million total_deaths_per_million new_deaths_per_million new_deaths_smoothed_per_million reproduction_rate icu_patients icu_patients_per_million hosp_patients hosp_patients_per_million weekly_icu_admissions weekly_icu_admissions_per_million weekly_hosp_admissions weekly_hosp_admissions_per_million new_tests total_tests total_tests_per_thousand new_tests_per_thousand new_tests_smoothed new_tests_smoothed_per_thousand positive_rate tests_per_case tests_units total_vaccinations people_vaccinated people_fully_vaccinated total_boosters new_vaccinations new_vaccinations_smoothed total_vaccinations_per_hundred people_vaccinated_per_hundred people_fully_vaccinated_per_hundred total_boosters_per_hundred new_vaccinations_smoothed_per_million new_people_vaccinated_smoothed new_people_vaccinated_smoothed_per_hundred stringency_index population population_density median_age aged_65_older aged_70_older gdp_per_capita extreme_poverty cardiovasc_death_rate diabetes_prevalence female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy human_development_index excess_mortality_cumulative_absolute excess_mortality_cumulative excess_mortality excess_mortality_cumulative_per_million
0 BRA South America Brazil 2020-02-26 1.0 1.0 NaN NaN NaN NaN 0.005 0.005 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5.56 213993441.0 25.04 33.5 8.552 5.06 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN
1 BRA South America Brazil 2020-02-27 1.0 0.0 NaN NaN NaN NaN 0.005 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5.56 213993441.0 25.04 33.5 8.552 5.06 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN

Utilizando a função .iloc podemos selecionar linhas e colunas com os índices numéricos. E se quisermos, por exemplo, comparar o número total de casos de todos os países ao longo do tempo, mantendo apenas as variáveis location, date e total_cases?

df_covid.iloc[:,2:5]
location date total_cases
0 Brazil 2020-02-26 1.0
1 Brazil 2020-02-27 1.0
2 Brazil 2020-02-28 1.0
3 Brazil 2020-02-29 2.0
4 Brazil 2020-03-01 2.0
... ... ... ...
7658 United States 2022-03-01 79091361.0
7659 United States 2022-03-02 79143716.0
7660 United States 2022-03-03 79196008.0
7661 United States 2022-03-04 79250509.0
7662 United States 2022-03-05 79265726.0

7663 rows × 3 columns

Podemos fazer exatamente a mesma coisa com a função .loc e utilizando uma mistura de índices numéricos para as linhas e seleção das colunas através de seus nomes.

df_covid.loc[:, ['location','date','total_cases']]
location date total_cases
0 Brazil 2020-02-26 1.0
1 Brazil 2020-02-27 1.0
2 Brazil 2020-02-28 1.0
3 Brazil 2020-02-29 2.0
4 Brazil 2020-03-01 2.0
... ... ... ...
7658 United States 2022-03-01 79091361.0
7659 United States 2022-03-02 79143716.0
7660 United States 2022-03-03 79196008.0
7661 United States 2022-03-04 79250509.0
7662 United States 2022-03-05 79265726.0

7663 rows × 3 columns

Uma terceira forma de fazer esse exercício de seleção de uma parte de um DataFrame já criado é através da criação de um novo DataFrame que mantenha apenas as colunas location, date e total_cases.

columns_to_keep = ['location','date','total_cases']
df_covid2 = df_covid[columns_to_keep]

df_covid2.head(2)
location date total_cases
0 Brazil 2020-02-26 1.0
1 Brazil 2020-02-27 1.0

Podemos inclusive alterar o nome das colunas desse novo DataFrame de forma bem direta.

df_covid2.columns = ['localizacao','data','casos']
df_covid2.head()
localizacao data casos
0 Brazil 2020-02-26 1.0
1 Brazil 2020-02-27 1.0
2 Brazil 2020-02-28 1.0
3 Brazil 2020-02-29 2.0
4 Brazil 2020-03-01 2.0

É possível redefinir o índice desse novo DataFrame para ser igual ao país.

df_covid2.set_index('localizacao', inplace=True)
df_covid2.head(2)
data casos
localizacao
Brazil 2020-02-26 1.0
Brazil 2020-02-27 1.0

Por fim, podemos selecionar parte específica de um DataFrame utilizando testes booleanos sobre valores de linhas e colunas.

  • Selecionar apenas os dados brasileiros
df_covid[df_covid['location']=='Brazil']
iso_code continent location date total_cases new_cases new_cases_smoothed total_deaths new_deaths new_deaths_smoothed total_cases_per_million new_cases_per_million new_cases_smoothed_per_million total_deaths_per_million new_deaths_per_million new_deaths_smoothed_per_million reproduction_rate icu_patients icu_patients_per_million hosp_patients hosp_patients_per_million weekly_icu_admissions weekly_icu_admissions_per_million weekly_hosp_admissions weekly_hosp_admissions_per_million new_tests total_tests total_tests_per_thousand new_tests_per_thousand new_tests_smoothed new_tests_smoothed_per_thousand positive_rate tests_per_case tests_units total_vaccinations people_vaccinated people_fully_vaccinated total_boosters new_vaccinations new_vaccinations_smoothed total_vaccinations_per_hundred people_vaccinated_per_hundred people_fully_vaccinated_per_hundred total_boosters_per_hundred new_vaccinations_smoothed_per_million new_people_vaccinated_smoothed new_people_vaccinated_smoothed_per_hundred stringency_index population population_density median_age aged_65_older aged_70_older gdp_per_capita extreme_poverty cardiovasc_death_rate diabetes_prevalence female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy human_development_index excess_mortality_cumulative_absolute excess_mortality_cumulative excess_mortality excess_mortality_cumulative_per_million
0 BRA South America Brazil 2020-02-26 1.0 1.0 NaN NaN NaN NaN 0.005 0.005 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5.56 213993441.0 25.04 33.5 8.552 5.06 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN
1 BRA South America Brazil 2020-02-27 1.0 0.0 NaN NaN NaN NaN 0.005 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5.56 213993441.0 25.04 33.5 8.552 5.06 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN
2 BRA South America Brazil 2020-02-28 1.0 0.0 NaN NaN NaN NaN 0.005 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5.56 213993441.0 25.04 33.5 8.552 5.06 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN
3 BRA South America Brazil 2020-02-29 2.0 1.0 NaN NaN NaN NaN 0.009 0.005 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5.56 213993441.0 25.04 33.5 8.552 5.06 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.2 75.88 0.765 7049.6 3.31 5.25 32.943066
4 BRA South America Brazil 2020-03-01 2.0 0.0 NaN NaN NaN NaN 0.009 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5.56 213993441.0 25.04 33.5 8.552 5.06 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
734 BRA South America Brazil 2022-03-01 28818850.0 22279.0 65271.286 649922.0 246.0 598.143 134671.651 104.111 305.015 3037.112 1.150 2.795 0.48 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 19430.0 0.091 NaN NaN tests performed 392925626.0 177836868.0 155071690.0 65073212.0 161992.0 814209.0 183.62 83.10 72.47 30.41 3805.0 173043.0 0.081 NaN 213993441.0 25.04 33.5 8.552 5.06 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN
735 BRA South America Brazil 2022-03-02 28846495.0 27645.0 50451.286 650254.0 332.0 505.714 134800.837 129.186 235.761 3038.663 1.551 2.363 0.45 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 19430.0 0.091 NaN NaN tests performed NaN NaN NaN NaN NaN 744548.0 NaN NaN NaN NaN 3479.0 141712.0 0.066 NaN 213993441.0 25.04 33.5 8.552 5.06 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN
736 BRA South America Brazil 2022-03-03 28906672.0 60177.0 45348.143 650824.0 570.0 445.857 135082.047 281.210 211.914 3041.327 2.664 2.084 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 69951873.0 326.888 NaN 19430.0 0.091 NaN NaN tests performed 394111796.0 177987744.0 155380086.0 65802661.0 NaN 623346.0 184.17 83.17 72.61 30.75 2913.0 116179.0 0.054 NaN 213993441.0 25.04 33.5 8.552 5.06 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN
737 BRA South America Brazil 2022-03-04 28978052.0 71380.0 42625.857 651522.0 698.0 432.286 135415.608 333.562 199.192 3044.589 3.262 2.020 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 394812859.0 178113009.0 155569412.0 66190014.0 701063.0 508531.0 184.50 83.23 72.70 30.93 2376.0 94869.0 0.044 NaN 213993441.0 25.04 33.5 8.552 5.06 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN
738 BRA South America Brazil 2022-03-05 29040800.0 62748.0 41606.857 652216.0 694.0 433.143 135708.832 293.224 194.431 3047.832 3.243 2.024 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 395464826.0 178203894.0 155711613.0 66609966.0 651967.0 525686.0 184.80 83.28 72.76 31.13 2457.0 94157.0 0.044 NaN 213993441.0 25.04 33.5 8.552 5.06 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN

739 rows × 67 columns

  • Selecionar os dados brasileiros e apenas à partir do momento em que o número de casos se tornou maior do que 10 milhões
df_covid[(df_covid['location']=='Brazil') & (df_covid['total_cases']>=10000000)]
iso_code continent location date total_cases new_cases new_cases_smoothed total_deaths new_deaths new_deaths_smoothed total_cases_per_million new_cases_per_million new_cases_smoothed_per_million total_deaths_per_million new_deaths_per_million new_deaths_smoothed_per_million reproduction_rate icu_patients icu_patients_per_million hosp_patients hosp_patients_per_million weekly_icu_admissions weekly_icu_admissions_per_million weekly_hosp_admissions weekly_hosp_admissions_per_million new_tests total_tests total_tests_per_thousand new_tests_per_thousand new_tests_smoothed new_tests_smoothed_per_thousand positive_rate tests_per_case tests_units total_vaccinations people_vaccinated people_fully_vaccinated total_boosters new_vaccinations new_vaccinations_smoothed total_vaccinations_per_hundred people_vaccinated_per_hundred people_fully_vaccinated_per_hundred total_boosters_per_hundred new_vaccinations_smoothed_per_million new_people_vaccinated_smoothed new_people_vaccinated_smoothed_per_hundred stringency_index population population_density median_age aged_65_older aged_70_older gdp_per_capita extreme_poverty cardiovasc_death_rate diabetes_prevalence female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy human_development_index excess_mortality_cumulative_absolute excess_mortality_cumulative excess_mortality excess_mortality_cumulative_per_million
358 BRA South America Brazil 2021-02-18 10039804.0 53479.0 45188.143 243752.0 1439.0 1032.714 46916.410 249.910 211.166 1139.063 6.725 4.826 1.02 NaN NaN NaN NaN NaN NaN NaN NaN 33829.0 33909730.0 158.462 0.158 66854.0 0.312 NaN NaN tests performed 6535156.0 5640483.0 894673.0 NaN 328476.0 262717.0 3.05 2.64 0.42 NaN 1228.0 150602.0 0.070 73.61 213993441.0 25.04 33.5 8.552 5.06 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN
359 BRA South America Brazil 2021-02-19 10091961.0 52157.0 45647.429 245083.0 1331.0 1048.571 47160.142 243.732 213.312 1145.283 6.220 4.900 1.04 NaN NaN NaN NaN NaN NaN NaN NaN 108909.0 34018639.0 158.970 0.509 59659.0 0.279 NaN NaN tests performed 6813168.0 5756361.0 1056807.0 NaN 278012.0 241137.0 3.18 2.69 0.49 NaN 1127.0 115660.0 0.054 73.61 213993441.0 25.04 33.5 8.552 5.06 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN
360 BRA South America Brazil 2021-02-20 10144332.0 52371.0 46721.000 246114.0 1031.0 1049.429 47404.873 244.732 218.329 1150.101 4.818 4.904 1.06 NaN NaN NaN NaN NaN NaN NaN NaN 128106.0 34146745.0 159.569 0.599 69117.0 0.323 NaN NaN tests performed 6950802.0 5817908.0 1132894.0 NaN 137634.0 244837.0 3.25 2.72 0.53 NaN 1144.0 110742.0 0.052 73.61 213993441.0 25.04 33.5 8.552 5.06 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN
361 BRA South America Brazil 2021-02-21 10173817.0 29485.0 46544.143 246720.0 606.0 1042.714 47542.658 137.785 217.503 1152.933 2.832 4.873 1.07 NaN NaN NaN NaN NaN NaN NaN NaN 28496.0 34175241.0 159.702 0.133 66623.0 0.311 NaN NaN tests performed 7028356.0 5857080.0 1171276.0 NaN 77554.0 247768.0 3.28 2.74 0.55 NaN 1158.0 111567.0 0.052 73.61 213993441.0 25.04 33.5 8.552 5.06 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN
362 BRA South America Brazil 2021-02-22 10205663.0 31846.0 47406.714 247431.0 711.0 1057.000 47691.476 148.818 221.533 1156.255 3.323 4.939 1.08 NaN NaN NaN NaN NaN NaN NaN NaN 9041.0 34184282.0 159.745 0.042 63283.0 0.296 NaN NaN tests performed 7297061.0 6002873.0 1294188.0 NaN 268705.0 241018.0 3.41 2.81 0.60 NaN 1126.0 96937.0 0.045 73.61 213993441.0 25.04 33.5 8.552 5.06 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
734 BRA South America Brazil 2022-03-01 28818850.0 22279.0 65271.286 649922.0 246.0 598.143 134671.651 104.111 305.015 3037.112 1.150 2.795 0.48 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 19430.0 0.091 NaN NaN tests performed 392925626.0 177836868.0 155071690.0 65073212.0 161992.0 814209.0 183.62 83.10 72.47 30.41 3805.0 173043.0 0.081 NaN 213993441.0 25.04 33.5 8.552 5.06 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN
735 BRA South America Brazil 2022-03-02 28846495.0 27645.0 50451.286 650254.0 332.0 505.714 134800.837 129.186 235.761 3038.663 1.551 2.363 0.45 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 19430.0 0.091 NaN NaN tests performed NaN NaN NaN NaN NaN 744548.0 NaN NaN NaN NaN 3479.0 141712.0 0.066 NaN 213993441.0 25.04 33.5 8.552 5.06 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN
736 BRA South America Brazil 2022-03-03 28906672.0 60177.0 45348.143 650824.0 570.0 445.857 135082.047 281.210 211.914 3041.327 2.664 2.084 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 69951873.0 326.888 NaN 19430.0 0.091 NaN NaN tests performed 394111796.0 177987744.0 155380086.0 65802661.0 NaN 623346.0 184.17 83.17 72.61 30.75 2913.0 116179.0 0.054 NaN 213993441.0 25.04 33.5 8.552 5.06 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN
737 BRA South America Brazil 2022-03-04 28978052.0 71380.0 42625.857 651522.0 698.0 432.286 135415.608 333.562 199.192 3044.589 3.262 2.020 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 394812859.0 178113009.0 155569412.0 66190014.0 701063.0 508531.0 184.50 83.23 72.70 30.93 2376.0 94869.0 0.044 NaN 213993441.0 25.04 33.5 8.552 5.06 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN
738 BRA South America Brazil 2022-03-05 29040800.0 62748.0 41606.857 652216.0 694.0 433.143 135708.832 293.224 194.431 3047.832 3.243 2.024 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 395464826.0 178203894.0 155711613.0 66609966.0 651967.0 525686.0 184.80 83.28 72.76 31.13 2457.0 94157.0 0.044 NaN 213993441.0 25.04 33.5 8.552 5.06 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.2 75.88 0.765 NaN NaN NaN NaN

381 rows × 67 columns

  • Selecionar os dados do Brasil e Estados Unidos
df_covid[df_covid['iso_code'].isin(['BRA','USA'])]
iso_code continent location date total_cases new_cases new_cases_smoothed total_deaths new_deaths new_deaths_smoothed total_cases_per_million new_cases_per_million new_cases_smoothed_per_million total_deaths_per_million new_deaths_per_million new_deaths_smoothed_per_million reproduction_rate icu_patients icu_patients_per_million hosp_patients hosp_patients_per_million weekly_icu_admissions weekly_icu_admissions_per_million weekly_hosp_admissions weekly_hosp_admissions_per_million new_tests total_tests total_tests_per_thousand new_tests_per_thousand new_tests_smoothed new_tests_smoothed_per_thousand positive_rate tests_per_case tests_units total_vaccinations people_vaccinated people_fully_vaccinated total_boosters new_vaccinations new_vaccinations_smoothed total_vaccinations_per_hundred people_vaccinated_per_hundred people_fully_vaccinated_per_hundred total_boosters_per_hundred new_vaccinations_smoothed_per_million new_people_vaccinated_smoothed new_people_vaccinated_smoothed_per_hundred stringency_index population population_density median_age aged_65_older aged_70_older gdp_per_capita extreme_poverty cardiovasc_death_rate diabetes_prevalence female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy human_development_index excess_mortality_cumulative_absolute excess_mortality_cumulative excess_mortality excess_mortality_cumulative_per_million
0 BRA South America Brazil 2020-02-26 1.0 1.0 NaN NaN NaN NaN 0.005 0.005 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5.56 213993441.0 25.040 33.5 8.552 5.060 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.20 75.88 0.765 NaN NaN NaN NaN
1 BRA South America Brazil 2020-02-27 1.0 0.0 NaN NaN NaN NaN 0.005 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5.56 213993441.0 25.040 33.5 8.552 5.060 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.20 75.88 0.765 NaN NaN NaN NaN
2 BRA South America Brazil 2020-02-28 1.0 0.0 NaN NaN NaN NaN 0.005 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5.56 213993441.0 25.040 33.5 8.552 5.060 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.20 75.88 0.765 NaN NaN NaN NaN
3 BRA South America Brazil 2020-02-29 2.0 1.0 NaN NaN NaN NaN 0.009 0.005 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5.56 213993441.0 25.040 33.5 8.552 5.060 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.20 75.88 0.765 7049.6 3.31 5.25 32.943066
4 BRA South America Brazil 2020-03-01 2.0 0.0 NaN NaN NaN NaN 0.009 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5.56 213993441.0 25.040 33.5 8.552 5.060 14103.452 3.4 177.961 8.11 10.1 17.9 NaN 2.20 75.88 0.765 NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7658 USA North America United States 2022-03-01 79091361.0 47031.0 62331.429 952423.0 1691.0 1830.000 237572.183 141.270 187.229 2860.859 5.079 5.497 0.59 7314.0 21.970 36214.0 108.778 NaN NaN 32132.0 96.517 NaN NaN NaN NaN NaN NaN NaN NaN NaN 554475530.0 253899142.0 215166280.0 94753028.0 241635.0 264689.0 167.01 76.47 64.81 28.54 797.0 69472.0 0.021 NaN 332915074.0 35.608 38.3 15.413 9.732 54225.446 1.2 151.089 10.79 19.1 24.6 NaN 2.77 78.86 0.926 NaN NaN NaN NaN
7659 USA North America United States 2022-03-02 79143716.0 52355.0 57824.857 954518.0 2095.0 1725.000 237729.446 157.262 173.693 2867.152 6.293 5.182 0.59 6993.0 21.005 34699.0 104.228 NaN NaN 30821.0 92.579 NaN NaN NaN NaN NaN NaN NaN NaN NaN 554692730.0 253956787.0 215226511.0 94847407.0 217200.0 250409.0 167.07 76.49 64.83 28.57 754.0 65798.0 0.020 NaN 332915074.0 35.608 38.3 15.413 9.732 54225.446 1.2 151.089 10.79 19.1 24.6 NaN 2.77 78.86 0.926 NaN NaN NaN NaN
7660 USA North America United States 2022-03-03 79196008.0 52292.0 55673.857 956261.0 1743.0 1553.714 237886.519 157.073 167.231 2872.387 5.236 4.667 NaN 7015.0 21.071 34108.0 102.453 NaN NaN 29613.0 88.951 NaN NaN NaN NaN NaN NaN NaN NaN NaN 554842009.0 253995527.0 215266826.0 94914378.0 149279.0 228576.0 167.12 76.50 64.84 28.59 688.0 60181.0 0.018 NaN 332915074.0 35.608 38.3 15.413 9.732 54225.446 1.2 151.089 10.79 19.1 24.6 NaN 2.77 78.86 0.926 NaN NaN NaN NaN
7661 USA North America United States 2022-03-04 79250509.0 54501.0 52764.286 958144.0 1883.0 1497.857 238050.227 163.708 158.492 2878.043 5.656 4.499 NaN 6203.0 18.632 30560.0 91.795 NaN NaN 28778.0 86.442 NaN NaN NaN NaN NaN NaN NaN NaN NaN 554868239.0 254002347.0 215274110.0 94925621.0 26230.0 180572.0 167.12 76.50 64.84 28.59 544.0 48110.0 0.014 NaN 332915074.0 35.608 38.3 15.413 9.732 54225.446 1.2 151.089 10.79 19.1 24.6 NaN 2.77 78.86 0.926 NaN NaN NaN NaN
7662 USA North America United States 2022-03-05 79265726.0 15217.0 47989.857 958437.0 293.0 1425.714 238095.936 45.708 144.150 2878.923 0.880 4.283 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 554868239.0 254002347.0 215274110.0 94925621.0 0.0 148251.0 167.12 76.50 64.84 28.59 447.0 39094.0 0.012 NaN 332915074.0 35.608 38.3 15.413 9.732 54225.446 1.2 151.089 10.79 19.1 24.6 NaN 2.77 78.86 0.926 NaN NaN NaN NaN

1513 rows × 67 columns

8.4.4 Operações com colunas numéricas

Criar novas colunas a partir de operações sobre aquelas que já existem é bastante simples. Além disso, vamos manter apenas os dados para o Brasil para melhor visualizar os resultados.

df_covid_br = df_covid[df_covid['iso_code']=='BRA'].reset_index()

Agora já podemos brincar com as colunas numéricas e mostrar o resultado dos cálculos.

  • Coluna com o aumento percentual de casos a cada dia
df_covid_br['new_cases_percentage'] = df_covid_br['new_cases'] / df_covid_br['total_cases']

columns_to_show = ['date','total_cases','new_cases','new_cases_percentage']
df_covid_br[columns_to_show]
date total_cases new_cases new_cases_percentage
0 2020-02-26 1.0 1.0 1.000000
1 2020-02-27 1.0 0.0 0.000000
2 2020-02-28 1.0 0.0 0.000000
3 2020-02-29 2.0 1.0 0.500000
4 2020-03-01 2.0 0.0 0.000000
... ... ... ... ...
734 2022-03-01 28818850.0 22279.0 0.000773
735 2022-03-02 28846495.0 27645.0 0.000958
736 2022-03-03 28906672.0 60177.0 0.002082
737 2022-03-04 28978052.0 71380.0 0.002463
738 2022-03-05 29040800.0 62748.0 0.002161

739 rows × 4 columns

  • Coluna com o log do total de casos
import numpy as np
df_covid_br['log_total_cases'] = np.log(df_covid_br['total_cases'])

columns_to_show = ['date','total_cases','log_total_cases']
df_covid_br[columns_to_show]
date total_cases log_total_cases
0 2020-02-26 1.0 0.000000
1 2020-02-27 1.0 0.000000
2 2020-02-28 1.0 0.000000
3 2020-02-29 2.0 0.693147
4 2020-03-01 2.0 0.693147
... ... ... ...
734 2022-03-01 28818850.0 17.176540
735 2022-03-02 28846495.0 17.177499
736 2022-03-03 28906672.0 17.179583
737 2022-03-04 28978052.0 17.182049
738 2022-03-05 29040800.0 17.184212

739 rows × 3 columns

  • Coluna com o total de casos em milhares
df_covid_br['total_cases_th'] = df_covid_br['total_cases'] / 1000

columns_to_show = ['date','total_cases','total_cases_th']
df_covid_br[columns_to_show]
date total_cases total_cases_th
0 2020-02-26 1.0 0.001
1 2020-02-27 1.0 0.001
2 2020-02-28 1.0 0.001
3 2020-02-29 2.0 0.002
4 2020-03-01 2.0 0.002
... ... ... ...
734 2022-03-01 28818850.0 28818.850
735 2022-03-02 28846495.0 28846.495
736 2022-03-03 28906672.0 28906.672
737 2022-03-04 28978052.0 28978.052
738 2022-03-05 29040800.0 29040.800

739 rows × 3 columns

  • Subtrair a média do número de novos casos da coluna de novos casos para encontrar o desvio em relação à média.
df_covid_br['new_cases_demean'] = df_covid_br['new_cases'] - df_covid_br['new_cases'].mean()

columns_to_show = ['date','new_cases','new_cases_demean']
df_covid_br[columns_to_show]
date new_cases new_cases_demean
0 2020-02-26 1.0 -39260.489824
1 2020-02-27 0.0 -39261.489824
2 2020-02-28 0.0 -39261.489824
3 2020-02-29 1.0 -39260.489824
4 2020-03-01 0.0 -39261.489824
... ... ... ...
734 2022-03-01 22279.0 -16982.489824
735 2022-03-02 27645.0 -11616.489824
736 2022-03-03 60177.0 20915.510176
737 2022-03-04 71380.0 32118.510176
738 2022-03-05 62748.0 23486.510176

739 rows × 3 columns

Para realizar outras operações simples de soma, subtração, multiplicação e divisão a lógica é a mesma, independente de ser uma operação entre colunas ou de uma coluna com um escalar. Para operações mais complexas, no entanto, o método .apply em conjunto com funções anônimas do tipo lambda constituem uma ferramenta poderosíssima.

  • Elevar o número de novos casos ao quadrado e dividir o resultado por 1 milhão.
df_covid_br['total_cases_sq'] = df_covid_br['total_cases'].apply(lambda x: (x**2)/ 1e6)

columns_to_show = ['date','total_cases','total_cases_sq']
df_covid_br[columns_to_show]
date total_cases total_cases_sq
0 2020-02-26 1.0 1.000000e-06
1 2020-02-27 1.0 1.000000e-06
2 2020-02-28 1.0 1.000000e-06
3 2020-02-29 2.0 4.000000e-06
4 2020-03-01 2.0 4.000000e-06
... ... ... ...
734 2022-03-01 28818850.0 8.305261e+08
735 2022-03-02 28846495.0 8.321203e+08
736 2022-03-03 28906672.0 8.355957e+08
737 2022-03-04 28978052.0 8.397275e+08
738 2022-03-05 29040800.0 8.433681e+08

739 rows × 3 columns

Para visualizar melhor, vamos focar no último valor, representado pelo índice 738.

print(df_covid_br.loc[738,'total_cases'])
print(df_covid_br.loc[738,'total_cases_sq'])
print((df_covid_br.loc[738,'total_cases'] ** 2) / 1000000)
29040800.0
843368064.64
843368064.64

Dei apenas exemplos super simples, mas note a força que o método apply tem quando trabalhado em conjunto com as funções lambda. O céu é o limite.

8.4.5 Operações com colunas contendo strings

O Pandas também nos fornece métodos e funções bastante interessantes para trabalhar com dados do tipo string. Vamos começar com um exemplo simples: transformando a nossa coluna com o nome dos países para um coluna com letras minúsculas apenas. Para trabalhar com essas funções/métodos precisamos primeiro dizer ao Pandas que a coluna de interesse receberá operações específicas a strings.

df_covid_br['lower_location'] = df_covid_br['location'].str.lower()

columns_to_show = ['iso_code','continent','location','lower_location','date']
df_covid_br[columns_to_show]
iso_code continent location lower_location date
0 BRA South America Brazil brazil 2020-02-26
1 BRA South America Brazil brazil 2020-02-27
2 BRA South America Brazil brazil 2020-02-28
3 BRA South America Brazil brazil 2020-02-29
4 BRA South America Brazil brazil 2020-03-01
... ... ... ... ... ...
734 BRA South America Brazil brazil 2022-03-01
735 BRA South America Brazil brazil 2022-03-02
736 BRA South America Brazil brazil 2022-03-03
737 BRA South America Brazil brazil 2022-03-04
738 BRA South America Brazil brazil 2022-03-05

739 rows × 5 columns

Podemos aplicar a maior parte dos métodos de strings que já vimos (replace,starswith,strip,extract, etc.) utilizando essa sintaxe. Uma outra operação interessante, para, por exemplo, separar em colunas distintas o primeiro e o último nome de uma pessoa é realizada através do método split. Vamos tentar fazer isso com a coluna do continente.

df_covid_br[['continent_1', 'continent_2']] = df_covid_br['continent'].str.split(pat=' ',n=1, expand=True, regex=False)

columns_to_show = ['iso_code','continent','continent_1','continent_2','location','lower_location','date']
df_covid_br[columns_to_show]
iso_code continent continent_1 continent_2 location lower_location date
0 BRA South America South America Brazil brazil 2020-02-26
1 BRA South America South America Brazil brazil 2020-02-27
2 BRA South America South America Brazil brazil 2020-02-28
3 BRA South America South America Brazil brazil 2020-02-29
4 BRA South America South America Brazil brazil 2020-03-01
... ... ... ... ... ... ... ...
734 BRA South America South America Brazil brazil 2022-03-01
735 BRA South America South America Brazil brazil 2022-03-02
736 BRA South America South America Brazil brazil 2022-03-03
737 BRA South America South America Brazil brazil 2022-03-04
738 BRA South America South America Brazil brazil 2022-03-05

739 rows × 7 columns

Vamos olhar com calma esse resultado e os argumentos que utilizamos dentro do método split.

  • pat: determina os caracteres que serão utilizados para fazer a divisão do string.
  • n: limita o número de divisões a serem feitas no string.
  • expand: caso o valor desse argumento seja igual a True, a divisão do string irá gerar n+1 novas colunas no dataframe.
  • regex: determina se o padrão pat, utilizado para a divisão do string, deve ser entendido como uma expressão regular ou não.

Bacana né? Esse método split nos permite fazer muita coisa legal com colunas de texto!

Por fim, uma outra funcionalidade interessante é a possibilidade de criação de novas colunas de strings utilizando dicionários. Podemos, por exemplo, utilizar essa funcionalidade para criar uma nova coluna que mostre se o número de novos casos por milhão está no quartil superior naquele dia ou não.

df_covid_br['new_cases_per_million'].describe()
count     737.000000
mean      183.470531
std       173.332413
min         0.000000
25%        62.684000
50%       148.126000
75%       254.279000
max      1341.859000
Name: new_cases_per_million, dtype: float64
# criando coluna a partir de list comprehension
df_covid_br['quartil_superior_bool'] = [True if elem>=254.279 else False for elem in df_covid_br['new_cases_per_million']]


# criando coluna a partir de um dicionario
quartil_superior = {
    True:'Acima do 3º quartil de novos casos por milhão',
    False:'Abaixo'
}

df_covid_br['quartil_superior_cat'] = df_covid_br['quartil_superior_bool'].apply(lambda x: quartil_superior[x])

E qual é o resultado final?

df_covid_br[['date','new_cases_per_million','quartil_superior_bool','quartil_superior_cat']]
date new_cases_per_million quartil_superior_bool quartil_superior_cat
0 2020-02-26 0.005 False Abaixo
1 2020-02-27 0.000 False Abaixo
2 2020-02-28 0.000 False Abaixo
3 2020-02-29 0.005 False Abaixo
4 2020-03-01 0.000 False Abaixo
... ... ... ... ...
734 2022-03-01 104.111 False Abaixo
735 2022-03-02 129.186 False Abaixo
736 2022-03-03 281.210 True Acima do 3º quartil de novos casos por milhão
737 2022-03-04 333.562 True Acima do 3º quartil de novos casos por milhão
738 2022-03-05 293.224 True Acima do 3º quartil de novos casos por milhão

739 rows × 4 columns

8.4.6 Agrupamento e junção de DataFrames

Suponha que queiramos calcular médias mensais de novos casos e novos casos por milhão e salvar isso em um novo dataframe. Para essa operação podemos utilizar o famoso groupby. Vamos fazer isso para os dados brasileiros e norte-americanos de forma separada.

df_covid['ano_mes'] = df_covid['date'].str[0:7]
columns_to_keep = ['location','ano_mes','new_cases','new_cases_per_million']

df_covid_br = df_covid[df_covid['iso_code']=='BRA'][columns_to_keep]
df_covid_us = df_covid[df_covid['iso_code']=='USA'][columns_to_keep]

df_covid_br.head(2)
location ano_mes new_cases new_cases_per_million
0 Brazil 2020-02 1.0 0.005
1 Brazil 2020-02 0.0 0.000
df_covid_br_mensal = df_covid_br.groupby(by=['location','ano_mes'], as_index=False).mean()
df_covid_us_mensal = df_covid_us.groupby(by=['location','ano_mes'], as_index=False).mean()

df_covid_br_mensal
location ano_mes new_cases new_cases_per_million
0 Brazil 2020-02 0.500000 0.002500
1 Brazil 2020-03 184.354839 0.861452
2 Brazil 2020-04 2715.666667 12.690400
3 Brazil 2020-05 13837.096774 64.661387
4 Brazil 2020-06 31597.517241 147.656517
5 Brazil 2020-07 40573.612903 189.602194
6 Brazil 2020-08 40141.225806 187.581516
7 Brazil 2020-09 30084.533333 140.586233
8 Brazil 2020-10 23238.645161 108.595065
9 Brazil 2020-11 26806.733333 125.268900
10 Brazil 2020-12 43195.709677 201.855258
11 Brazil 2021-01 49273.838710 230.258677
12 Brazil 2021-02 48198.928571 225.235607
13 Brazil 2021-03 71138.741935 332.434258
14 Brazil 2021-04 63723.733333 297.783600
15 Brazil 2021-05 60734.967742 283.817000
16 Brazil 2021-06 67080.266667 313.468767
17 Brazil 2021-07 43608.064516 203.782323
18 Brazil 2021-08 27840.322581 130.098935
19 Brazil 2021-09 18009.931034 84.161034
20 Brazil 2021-10 12331.064516 57.623548
21 Brazil 2021-11 9517.266667 44.474533
22 Brazil 2021-12 6183.741935 28.896806
23 Brazil 2022-01 102312.612903 478.111000
24 Brazil 2022-02 119037.178571 556.265607
25 Brazil 2022-03 48845.800000 228.258600

Note que fizemos um dataframe para o Brasil e outro para os Estados Unidos. Podíamos ter feito tudo junto? Podíamos. Mas fizemos dessa forma justamente para mostrar outra funcionalidade bastante importante quando trabalhamos com bases de dados: a junção de duas bases diferentes. Essa junção de bases distintas pode ser realizada utilizando duas funções distintas: concat e merge.

A função concat “soma” duas bases, criando uma nova base que apenas junta as linhas dos diferentes dataframes. Isso é muito útil, por exemplo, quando geramos dataframes dentro de um loop e queremos ir juntando cada novo dataframe a um grande e único.

df_covid_concat = pd.concat(objs=[df_covid_br_mensal, df_covid_us_mensal]).reset_index(drop=True)

df_covid_concat
location ano_mes new_cases new_cases_per_million
0 Brazil 2020-02 0.500000 0.002500
1 Brazil 2020-03 184.354839 0.861452
2 Brazil 2020-04 2715.666667 12.690400
3 Brazil 2020-05 13837.096774 64.661387
4 Brazil 2020-06 31597.517241 147.656517
5 Brazil 2020-07 40573.612903 189.602194
6 Brazil 2020-08 40141.225806 187.581516
7 Brazil 2020-09 30084.533333 140.586233
8 Brazil 2020-10 23238.645161 108.595065
9 Brazil 2020-11 26806.733333 125.268900
10 Brazil 2020-12 43195.709677 201.855258
11 Brazil 2021-01 49273.838710 230.258677
12 Brazil 2021-02 48198.928571 225.235607
13 Brazil 2021-03 71138.741935 332.434258
14 Brazil 2021-04 63723.733333 297.783600
15 Brazil 2021-05 60734.967742 283.817000
16 Brazil 2021-06 67080.266667 313.468767
17 Brazil 2021-07 43608.064516 203.782323
18 Brazil 2021-08 27840.322581 130.098935
19 Brazil 2021-09 18009.931034 84.161034
20 Brazil 2021-10 12331.064516 57.623548
21 Brazil 2021-11 9517.266667 44.474533
22 Brazil 2021-12 6183.741935 28.896806
23 Brazil 2022-01 102312.612903 478.111000
24 Brazil 2022-02 119037.178571 556.265607
25 Brazil 2022-03 48845.800000 228.258600
26 United States 2020-01 0.777778 0.002333
27 United States 2020-02 0.586207 0.001759
28 United States 2020-03 6195.290323 18.609226
29 United States 2020-04 29444.100000 88.443333
30 United States 2020-05 22969.967742 68.996452
31 United States 2020-06 28578.766667 85.844033
32 United States 2020-07 61294.612903 184.114871
33 United States 2020-08 48365.322581 145.278355
34 United States 2020-09 39903.133333 119.859833
35 United States 2020-10 61870.806452 185.845677
36 United States 2020-11 149008.733333 447.587733
37 United States 2020-12 211725.580645 635.974742
38 United States 2021-01 198457.129032 596.119355
39 United States 2021-02 86034.107143 258.426643
40 United States 2021-03 58529.064516 175.807742
41 United States 2021-04 62970.333333 189.148333
42 United States 2021-05 29748.451613 89.357484
43 United States 2021-06 13276.700000 39.880100
44 United States 2021-07 42530.483871 127.751742
45 United States 2021-08 138009.000000 414.547258
46 United States 2021-09 138317.100000 415.472700
47 United States 2021-10 80986.354839 243.264323
48 United States 2021-11 84910.433333 255.051300
49 United States 2021-12 200890.870968 603.429839
50 United States 2022-01 650778.903226 1954.789548
51 United States 2022-02 145005.857143 435.564107
52 United States 2022-03 44279.200000 133.004200

Já a função merge une duas bases de acordo com colunas identificadoras. Vamos juntar as bases do Brasil e EUA utilizando a coluna de ano-mês como identificadora.

df_covid_merge = pd.merge(df_covid_br_mensal, df_covid_us_mensal, how='inner', on=['ano_mes'])

df_covid_merge
location_x ano_mes new_cases_x new_cases_per_million_x location_y new_cases_y new_cases_per_million_y
0 Brazil 2020-02 0.500000 0.002500 United States 0.586207 0.001759
1 Brazil 2020-03 184.354839 0.861452 United States 6195.290323 18.609226
2 Brazil 2020-04 2715.666667 12.690400 United States 29444.100000 88.443333
3 Brazil 2020-05 13837.096774 64.661387 United States 22969.967742 68.996452
4 Brazil 2020-06 31597.517241 147.656517 United States 28578.766667 85.844033
5 Brazil 2020-07 40573.612903 189.602194 United States 61294.612903 184.114871
6 Brazil 2020-08 40141.225806 187.581516 United States 48365.322581 145.278355
7 Brazil 2020-09 30084.533333 140.586233 United States 39903.133333 119.859833
8 Brazil 2020-10 23238.645161 108.595065 United States 61870.806452 185.845677
9 Brazil 2020-11 26806.733333 125.268900 United States 149008.733333 447.587733
10 Brazil 2020-12 43195.709677 201.855258 United States 211725.580645 635.974742
11 Brazil 2021-01 49273.838710 230.258677 United States 198457.129032 596.119355
12 Brazil 2021-02 48198.928571 225.235607 United States 86034.107143 258.426643
13 Brazil 2021-03 71138.741935 332.434258 United States 58529.064516 175.807742
14 Brazil 2021-04 63723.733333 297.783600 United States 62970.333333 189.148333
15 Brazil 2021-05 60734.967742 283.817000 United States 29748.451613 89.357484
16 Brazil 2021-06 67080.266667 313.468767 United States 13276.700000 39.880100
17 Brazil 2021-07 43608.064516 203.782323 United States 42530.483871 127.751742
18 Brazil 2021-08 27840.322581 130.098935 United States 138009.000000 414.547258
19 Brazil 2021-09 18009.931034 84.161034 United States 138317.100000 415.472700
20 Brazil 2021-10 12331.064516 57.623548 United States 80986.354839 243.264323
21 Brazil 2021-11 9517.266667 44.474533 United States 84910.433333 255.051300
22 Brazil 2021-12 6183.741935 28.896806 United States 200890.870968 603.429839
23 Brazil 2022-01 102312.612903 478.111000 United States 650778.903226 1954.789548
24 Brazil 2022-02 119037.178571 556.265607 United States 145005.857143 435.564107
25 Brazil 2022-03 48845.800000 228.258600 United States 44279.200000 133.004200

Que legal! Tome um tempo em casa para entender as particularidades dessa função merge, elas são muitas.

Mas note que existiam colunas com o mesmo nome nas bases. Por padrão o pandas renomeia essas colunas com _x e _y. Vamos renomear essas colunas utilizando dicionários, uma forma alternativa ao que fizemos anteriormente.

df_covid_merge.drop(columns=['location_x','location_y'], inplace=True)

dict_rename = {
    'new_cases_x':'nc_br',
    'new_cases_per_million_x':'ncpm_br',
    'new_cases_y':'nc_us',
    'new_cases_per_million_y':'ncpm_us'
}

df_covid_merge.rename(columns=dict_rename, inplace=True)

df_covid_merge
ano_mes nc_br ncpm_br nc_us ncpm_us
0 2020-02 0.500000 0.002500 0.586207 0.001759
1 2020-03 184.354839 0.861452 6195.290323 18.609226
2 2020-04 2715.666667 12.690400 29444.100000 88.443333
3 2020-05 13837.096774 64.661387 22969.967742 68.996452
4 2020-06 31597.517241 147.656517 28578.766667 85.844033
5 2020-07 40573.612903 189.602194 61294.612903 184.114871
6 2020-08 40141.225806 187.581516 48365.322581 145.278355
7 2020-09 30084.533333 140.586233 39903.133333 119.859833
8 2020-10 23238.645161 108.595065 61870.806452 185.845677
9 2020-11 26806.733333 125.268900 149008.733333 447.587733
10 2020-12 43195.709677 201.855258 211725.580645 635.974742
11 2021-01 49273.838710 230.258677 198457.129032 596.119355
12 2021-02 48198.928571 225.235607 86034.107143 258.426643
13 2021-03 71138.741935 332.434258 58529.064516 175.807742
14 2021-04 63723.733333 297.783600 62970.333333 189.148333
15 2021-05 60734.967742 283.817000 29748.451613 89.357484
16 2021-06 67080.266667 313.468767 13276.700000 39.880100
17 2021-07 43608.064516 203.782323 42530.483871 127.751742
18 2021-08 27840.322581 130.098935 138009.000000 414.547258
19 2021-09 18009.931034 84.161034 138317.100000 415.472700
20 2021-10 12331.064516 57.623548 80986.354839 243.264323
21 2021-11 9517.266667 44.474533 84910.433333 255.051300
22 2021-12 6183.741935 28.896806 200890.870968 603.429839
23 2022-01 102312.612903 478.111000 650778.903226 1954.789548
24 2022-02 119037.178571 556.265607 145005.857143 435.564107
25 2022-03 48845.800000 228.258600 44279.200000 133.004200

8.4.7 Exportando o DataFrame

Por fim, a última coisa que gostaríamos de aprender nesse aulão de Pandas é como salvar o dataframe, resultado de todas as nossas operações, para um novo csv ou arquivo de qualquer outro tipo. Assim como tínhamos funções do tipo read_csv para trazer arquivos de fora, temos métodos do tipo to_csv para salvar um dado dataframe em um novo arquivo a ser utilizado em outros softwares.

A sintaxe é bastante simples:

df_covid_merge.to_csv('df_resultado_aula_pandas.csv', sep=',', encoding='utf8', index=False)

os.listdir()
['df_resultado_aula_pandas.csv',
 'EAE1106 - Aula Fundamentos.html',
 'EAE1106 - Aula Fundamentos.pdf',
 'EAE1106 - Aula Funções.html',
 'EAE1106 - Aula Funções.pdf',
 'EAE1106 - Aula Instalação.html',
 'EAE1106 - Aula Instalação.pdf',
 'EAE1106 - Aula Iteração.html',
 'EAE1106 - Aula Iteração.pdf',
 'EAE1106 - Aula Matplotlib.html',
 'EAE1106 - Aula Matplotlib.pdf',
 'EAE1106 - Aula NumPy.html',
 'EAE1106 - Aula NumPy.pdf',
 'EAE1106 - Aula Objetos.html',
 'EAE1106 - Aula Objetos.pdf',
 'EAE1106 - Aula Pandas.html',
 'EAE1106 - Aula Pandas.pdf',
 'EAE1106_Fundamentos_slides.aux',
 'EAE1106_Fundamentos_slides.log',
 'EAE1106_Fundamentos_slides.nav',
 'EAE1106_Fundamentos_slides.out',
 'EAE1106_Fundamentos_slides.pdf',
 'EAE1106_Fundamentos_slides.snm',
 'EAE1106_Fundamentos_slides.synctex.gz',
 'EAE1106_Fundamentos_slides.tex',
 'EAE1106_Fundamentos_slides.toc',
 'EAE1106_Introducao_slides.aux',
 'EAE1106_Introducao_slides.log',
 'EAE1106_Introducao_slides.nav',
 'EAE1106_Introducao_slides.out',
 'EAE1106_Introducao_slides.pdf',
 'EAE1106_Introducao_slides.snm',
 'EAE1106_Introducao_slides.synctex.gz',
 'EAE1106_Introducao_slides.tex',
 'EAE1106_Introducao_slides.toc',
 'notebook_images',
 'owid-covid-data_top10.csv']

Olha o arquivo que geramos aí, minha gente!

Tente exportar o dataframe em outros formatos e importá-lo em outros softwares. Lembre-se, assim como quando queremos aprender novas línguas como inglês, alemão ou mandarim, em programação a fluência chega apenas com muito treino, tentativa e erro.

8.5 Eficiência e dtypes

DataFrames podem ser objetos bastante pesados, com milhões de linhas e milhares de colunas. Isso por vezes pode ser um problemão para nós. Existem bibliotecas que servem ao propósito de trabalhar de forma mais eficiente com bases de dados muito grandes (e.g., Dask), mas algumas funções dentro do próprio Pandas já conseguem nos ajudar bastante em grande parte das situações.

Vamos começar relendo a base completa de dados da covid com os 10 países com o maior número de casos. Além disso, imagine que por um descuido nosso tenhamos replicado todas as entradas por 10 vezes seguidas. Qual é o tamanho dessa base e quantos bytes de memória foram alocados para esse objeto?

df_covid_duplicada = pd.DataFrame()

for i in range(0,10):
    df_covid_duplicada = pd.concat([df_covid_duplicada, df_covid]).reset_index(drop=True)

df_covid_duplicada.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76630 entries, 0 to 76629
Data columns (total 68 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   iso_code                                    76630 non-null  object 
 1   continent                                   76630 non-null  object 
 2   location                                    76630 non-null  object 
 3   date                                        76630 non-null  object 
 4   total_cases                                 76630 non-null  float64
 5   new_cases                                   76410 non-null  float64
 6   new_cases_smoothed                          74840 non-null  float64
 7   total_deaths                                73450 non-null  float64
 8   new_deaths                                  73270 non-null  float64
 9   new_deaths_smoothed                         71730 non-null  float64
 10  total_cases_per_million                     76630 non-null  float64
 11  new_cases_per_million                       76410 non-null  float64
 12  new_cases_smoothed_per_million              74840 non-null  float64
 13  total_deaths_per_million                    73450 non-null  float64
 14  new_deaths_per_million                      73270 non-null  float64
 15  new_deaths_smoothed_per_million             71730 non-null  float64
 16  reproduction_rate                           72920 non-null  float64
 17  icu_patients                                42780 non-null  float64
 18  icu_patients_per_million                    42780 non-null  float64
 19  hosp_patients                               28610 non-null  float64
 20  hosp_patients_per_million                   28610 non-null  float64
 21  weekly_icu_admissions                       13770 non-null  float64
 22  weekly_icu_admissions_per_million           13770 non-null  float64
 23  weekly_hosp_admissions                      36030 non-null  float64
 24  weekly_hosp_admissions_per_million          36030 non-null  float64
 25  new_tests                                   57550 non-null  float64
 26  total_tests                                 60560 non-null  float64
 27  total_tests_per_thousand                    60560 non-null  float64
 28  new_tests_per_thousand                      57550 non-null  float64
 29  new_tests_smoothed                          71160 non-null  float64
 30  new_tests_smoothed_per_thousand             71160 non-null  float64
 31  positive_rate                               58120 non-null  float64
 32  tests_per_case                              58120 non-null  float64
 33  tests_units                                 71860 non-null  object 
 34  total_vaccinations                          39500 non-null  float64
 35  people_vaccinated                           39380 non-null  float64
 36  people_fully_vaccinated                     38600 non-null  float64
 37  total_boosters                              20700 non-null  float64
 38  new_vaccinations                            38580 non-null  float64
 39  new_vaccinations_smoothed                   41810 non-null  float64
 40  total_vaccinations_per_hundred              39500 non-null  float64
 41  people_vaccinated_per_hundred               39380 non-null  float64
 42  people_fully_vaccinated_per_hundred         38600 non-null  float64
 43  total_boosters_per_hundred                  20700 non-null  float64
 44  new_vaccinations_smoothed_per_million       41810 non-null  float64
 45  new_people_vaccinated_smoothed              41810 non-null  float64
 46  new_people_vaccinated_smoothed_per_hundred  41810 non-null  float64
 47  stringency_index                            75800 non-null  float64
 48  population                                  76630 non-null  float64
 49  population_density                          76630 non-null  float64
 50  median_age                                  76630 non-null  float64
 51  aged_65_older                               76630 non-null  float64
 52  aged_70_older                               76630 non-null  float64
 53  gdp_per_capita                              76630 non-null  float64
 54  extreme_poverty                             53480 non-null  float64
 55  cardiovasc_death_rate                       76630 non-null  float64
 56  diabetes_prevalence                         76630 non-null  float64
 57  female_smokers                              76630 non-null  float64
 58  male_smokers                                76630 non-null  float64
 59  handwashing_facilities                      7660 non-null   float64
 60  hospital_beds_per_thousand                  76630 non-null  float64
 61  life_expectancy                             76630 non-null  float64
 62  human_development_index                     76630 non-null  float64
 63  excess_mortality_cumulative_absolute        6920 non-null   float64
 64  excess_mortality_cumulative                 6920 non-null   float64
 65  excess_mortality                            6920 non-null   float64
 66  excess_mortality_cumulative_per_million     6920 non-null   float64
 67  ano_mes                                     76630 non-null  object 
dtypes: float64(62), object(6)
memory usage: 61.2 MB

O DataFrame df_covid_duplicada possui \(76,630\) linhas, 67 colunas e está usando 64.6 MB da nossa memória RAM. O primeiro passo para tornar o objeto mais eficiente é ver se encontramos linhas duplicadas em todas suas colunas, que não nos trazem nenhuma nova informação mas carregam desnecessariamente nossos pentes de memória. Para esse fim utilizaremos a função drop_duplicates.

df_covid_duplicada.drop_duplicates(inplace=True)

df_covid_duplicada.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
Index: 7663 entries, 0 to 7662
Data columns (total 68 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   iso_code                                    7663 non-null   object 
 1   continent                                   7663 non-null   object 
 2   location                                    7663 non-null   object 
 3   date                                        7663 non-null   object 
 4   total_cases                                 7663 non-null   float64
 5   new_cases                                   7641 non-null   float64
 6   new_cases_smoothed                          7484 non-null   float64
 7   total_deaths                                7345 non-null   float64
 8   new_deaths                                  7327 non-null   float64
 9   new_deaths_smoothed                         7173 non-null   float64
 10  total_cases_per_million                     7663 non-null   float64
 11  new_cases_per_million                       7641 non-null   float64
 12  new_cases_smoothed_per_million              7484 non-null   float64
 13  total_deaths_per_million                    7345 non-null   float64
 14  new_deaths_per_million                      7327 non-null   float64
 15  new_deaths_smoothed_per_million             7173 non-null   float64
 16  reproduction_rate                           7292 non-null   float64
 17  icu_patients                                4278 non-null   float64
 18  icu_patients_per_million                    4278 non-null   float64
 19  hosp_patients                               2861 non-null   float64
 20  hosp_patients_per_million                   2861 non-null   float64
 21  weekly_icu_admissions                       1377 non-null   float64
 22  weekly_icu_admissions_per_million           1377 non-null   float64
 23  weekly_hosp_admissions                      3603 non-null   float64
 24  weekly_hosp_admissions_per_million          3603 non-null   float64
 25  new_tests                                   5755 non-null   float64
 26  total_tests                                 6056 non-null   float64
 27  total_tests_per_thousand                    6056 non-null   float64
 28  new_tests_per_thousand                      5755 non-null   float64
 29  new_tests_smoothed                          7116 non-null   float64
 30  new_tests_smoothed_per_thousand             7116 non-null   float64
 31  positive_rate                               5812 non-null   float64
 32  tests_per_case                              5812 non-null   float64
 33  tests_units                                 7186 non-null   object 
 34  total_vaccinations                          3950 non-null   float64
 35  people_vaccinated                           3938 non-null   float64
 36  people_fully_vaccinated                     3860 non-null   float64
 37  total_boosters                              2070 non-null   float64
 38  new_vaccinations                            3858 non-null   float64
 39  new_vaccinations_smoothed                   4181 non-null   float64
 40  total_vaccinations_per_hundred              3950 non-null   float64
 41  people_vaccinated_per_hundred               3938 non-null   float64
 42  people_fully_vaccinated_per_hundred         3860 non-null   float64
 43  total_boosters_per_hundred                  2070 non-null   float64
 44  new_vaccinations_smoothed_per_million       4181 non-null   float64
 45  new_people_vaccinated_smoothed              4181 non-null   float64
 46  new_people_vaccinated_smoothed_per_hundred  4181 non-null   float64
 47  stringency_index                            7580 non-null   float64
 48  population                                  7663 non-null   float64
 49  population_density                          7663 non-null   float64
 50  median_age                                  7663 non-null   float64
 51  aged_65_older                               7663 non-null   float64
 52  aged_70_older                               7663 non-null   float64
 53  gdp_per_capita                              7663 non-null   float64
 54  extreme_poverty                             5348 non-null   float64
 55  cardiovasc_death_rate                       7663 non-null   float64
 56  diabetes_prevalence                         7663 non-null   float64
 57  female_smokers                              7663 non-null   float64
 58  male_smokers                                7663 non-null   float64
 59  handwashing_facilities                      766 non-null    float64
 60  hospital_beds_per_thousand                  7663 non-null   float64
 61  life_expectancy                             7663 non-null   float64
 62  human_development_index                     7663 non-null   float64
 63  excess_mortality_cumulative_absolute        692 non-null    float64
 64  excess_mortality_cumulative                 692 non-null    float64
 65  excess_mortality                            692 non-null    float64
 66  excess_mortality_cumulative_per_million     692 non-null    float64
 67  ano_mes                                     7663 non-null   object 
dtypes: float64(62), object(6)
memory usage: 6.2 MB

Legal, conseguimos dropar as linhas duplicadas e reduzimos o comprometimento de memória de 64.6 MB para 6.5MB.

No entanto, os tipos de dados de cada coluna não nos parecem os mais adequados. Uma coluna do tipo float64 consume 4x mais memória do que uma coluna do tipo float16 e 8x mais memória do que uma coluna do tipo int8, por exemplo. Da mesma forma, uma coluna do tipo object consome muito mais memória do que uma coluna que o Pandas entende como do tipo categórica em grande parte dos casos. Mas o que são esses tipos todos?

  • int8 / uint8: consome 1 byte (8-bit) de memória, o valor numérico deve ser inteiro e estar no intervalo \([-128 , 127]\) / \([0 , 255]\)
  • int16 / uint16: consome 2 bytes de memória, o valor numérico deve ser inteiro e estar no intervalo \([-32768 , 32768]\) / \([0,65535]\)
  • int32 / uint32: consome 4 bytes de memória e aceita valores numéricos inteiros com até 10 dígitos, da ordem de 2 bilhões.
  • int64 / uint64: consome 8 bytes de memória e aceita valores numéricos inteiros ainda maiores.
  • float16 / float32 / float64: consome 2, 4 e 8 bytes de memória, respectivamente, e aceita valores com casas decimais delimitado pelos mesmos intervalos dos tipos anteriores.
  • bool: consome 1 byte de memória e aceita apenas dois valores True e False.
  • object: o quanto consome de memória pode variar, mas, por aceitar valores numéricos e strings de todo tipo, consome ao menos 8 bytes de memória.
  • category: o quanto consome de memória pode variar e depende da relação entre o número de valores únicos na coluna e o número de linhas de dados.

Beleza, devemos fugir do tipo object e tentar reduzir as colunas numéricas para tipos menos demandantes em termos de memória. Como fazer isso? A função astype é nossa companheira nesse caso.

Voltando ao nosso dataframe original, as colunas iso_code, continent e location podem ser transformadas em colunas do tipo categórica sem nenhuma perda de informação. Comecemos por elas.

columns_to_cat = ['iso_code','continent','location']

for c in columns_to_cat:
    df_covid_duplicada[c] = df_covid_duplicada[c].astype('category')

O quanto isso nos ajudou?

df_covid_duplicada.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
Index: 7663 entries, 0 to 7662
Data columns (total 68 columns):
 #   Column                                      Non-Null Count  Dtype   
---  ------                                      --------------  -----   
 0   iso_code                                    7663 non-null   category
 1   continent                                   7663 non-null   category
 2   location                                    7663 non-null   category
 3   date                                        7663 non-null   object  
 4   total_cases                                 7663 non-null   float64 
 5   new_cases                                   7641 non-null   float64 
 6   new_cases_smoothed                          7484 non-null   float64 
 7   total_deaths                                7345 non-null   float64 
 8   new_deaths                                  7327 non-null   float64 
 9   new_deaths_smoothed                         7173 non-null   float64 
 10  total_cases_per_million                     7663 non-null   float64 
 11  new_cases_per_million                       7641 non-null   float64 
 12  new_cases_smoothed_per_million              7484 non-null   float64 
 13  total_deaths_per_million                    7345 non-null   float64 
 14  new_deaths_per_million                      7327 non-null   float64 
 15  new_deaths_smoothed_per_million             7173 non-null   float64 
 16  reproduction_rate                           7292 non-null   float64 
 17  icu_patients                                4278 non-null   float64 
 18  icu_patients_per_million                    4278 non-null   float64 
 19  hosp_patients                               2861 non-null   float64 
 20  hosp_patients_per_million                   2861 non-null   float64 
 21  weekly_icu_admissions                       1377 non-null   float64 
 22  weekly_icu_admissions_per_million           1377 non-null   float64 
 23  weekly_hosp_admissions                      3603 non-null   float64 
 24  weekly_hosp_admissions_per_million          3603 non-null   float64 
 25  new_tests                                   5755 non-null   float64 
 26  total_tests                                 6056 non-null   float64 
 27  total_tests_per_thousand                    6056 non-null   float64 
 28  new_tests_per_thousand                      5755 non-null   float64 
 29  new_tests_smoothed                          7116 non-null   float64 
 30  new_tests_smoothed_per_thousand             7116 non-null   float64 
 31  positive_rate                               5812 non-null   float64 
 32  tests_per_case                              5812 non-null   float64 
 33  tests_units                                 7186 non-null   object  
 34  total_vaccinations                          3950 non-null   float64 
 35  people_vaccinated                           3938 non-null   float64 
 36  people_fully_vaccinated                     3860 non-null   float64 
 37  total_boosters                              2070 non-null   float64 
 38  new_vaccinations                            3858 non-null   float64 
 39  new_vaccinations_smoothed                   4181 non-null   float64 
 40  total_vaccinations_per_hundred              3950 non-null   float64 
 41  people_vaccinated_per_hundred               3938 non-null   float64 
 42  people_fully_vaccinated_per_hundred         3860 non-null   float64 
 43  total_boosters_per_hundred                  2070 non-null   float64 
 44  new_vaccinations_smoothed_per_million       4181 non-null   float64 
 45  new_people_vaccinated_smoothed              4181 non-null   float64 
 46  new_people_vaccinated_smoothed_per_hundred  4181 non-null   float64 
 47  stringency_index                            7580 non-null   float64 
 48  population                                  7663 non-null   float64 
 49  population_density                          7663 non-null   float64 
 50  median_age                                  7663 non-null   float64 
 51  aged_65_older                               7663 non-null   float64 
 52  aged_70_older                               7663 non-null   float64 
 53  gdp_per_capita                              7663 non-null   float64 
 54  extreme_poverty                             5348 non-null   float64 
 55  cardiovasc_death_rate                       7663 non-null   float64 
 56  diabetes_prevalence                         7663 non-null   float64 
 57  female_smokers                              7663 non-null   float64 
 58  male_smokers                                7663 non-null   float64 
 59  handwashing_facilities                      766 non-null    float64 
 60  hospital_beds_per_thousand                  7663 non-null   float64 
 61  life_expectancy                             7663 non-null   float64 
 62  human_development_index                     7663 non-null   float64 
 63  excess_mortality_cumulative_absolute        692 non-null    float64 
 64  excess_mortality_cumulative                 692 non-null    float64 
 65  excess_mortality                            692 non-null    float64 
 66  excess_mortality_cumulative_per_million     692 non-null    float64 
 67  ano_mes                                     7663 non-null   object  
dtypes: category(3), float64(62), object(3)
memory usage: 5.0 MB

Caramba! Esse pequena alteração reduziu o comprometimento de memória em aproximadamente \(20\%\).

Vamos tentar transformar as colunas de total_cases, new_cases, total_deaths e new_deaths em colunas do tipo int32 ao invés de float64 já que são colunas sabidamente de números inteiros. Não esqueça de lidar com os valores missing antes, já que dados do tipo int não aceitam NaN ou infty como valores.

columns_to_int = ['total_cases','new_cases','total_deaths','new_deaths']

for c in columns_to_int:
    df_covid_duplicada[c] = df_covid_duplicada[c].fillna(-1)
    df_covid_duplicada[c] = df_covid_duplicada[c].astype('int32')

O quanto isso nos ajudou?

df_covid_duplicada.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
Index: 7663 entries, 0 to 7662
Data columns (total 68 columns):
 #   Column                                      Non-Null Count  Dtype   
---  ------                                      --------------  -----   
 0   iso_code                                    7663 non-null   category
 1   continent                                   7663 non-null   category
 2   location                                    7663 non-null   category
 3   date                                        7663 non-null   object  
 4   total_cases                                 7663 non-null   int32   
 5   new_cases                                   7663 non-null   int32   
 6   new_cases_smoothed                          7484 non-null   float64 
 7   total_deaths                                7663 non-null   int32   
 8   new_deaths                                  7663 non-null   int32   
 9   new_deaths_smoothed                         7173 non-null   float64 
 10  total_cases_per_million                     7663 non-null   float64 
 11  new_cases_per_million                       7641 non-null   float64 
 12  new_cases_smoothed_per_million              7484 non-null   float64 
 13  total_deaths_per_million                    7345 non-null   float64 
 14  new_deaths_per_million                      7327 non-null   float64 
 15  new_deaths_smoothed_per_million             7173 non-null   float64 
 16  reproduction_rate                           7292 non-null   float64 
 17  icu_patients                                4278 non-null   float64 
 18  icu_patients_per_million                    4278 non-null   float64 
 19  hosp_patients                               2861 non-null   float64 
 20  hosp_patients_per_million                   2861 non-null   float64 
 21  weekly_icu_admissions                       1377 non-null   float64 
 22  weekly_icu_admissions_per_million           1377 non-null   float64 
 23  weekly_hosp_admissions                      3603 non-null   float64 
 24  weekly_hosp_admissions_per_million          3603 non-null   float64 
 25  new_tests                                   5755 non-null   float64 
 26  total_tests                                 6056 non-null   float64 
 27  total_tests_per_thousand                    6056 non-null   float64 
 28  new_tests_per_thousand                      5755 non-null   float64 
 29  new_tests_smoothed                          7116 non-null   float64 
 30  new_tests_smoothed_per_thousand             7116 non-null   float64 
 31  positive_rate                               5812 non-null   float64 
 32  tests_per_case                              5812 non-null   float64 
 33  tests_units                                 7186 non-null   object  
 34  total_vaccinations                          3950 non-null   float64 
 35  people_vaccinated                           3938 non-null   float64 
 36  people_fully_vaccinated                     3860 non-null   float64 
 37  total_boosters                              2070 non-null   float64 
 38  new_vaccinations                            3858 non-null   float64 
 39  new_vaccinations_smoothed                   4181 non-null   float64 
 40  total_vaccinations_per_hundred              3950 non-null   float64 
 41  people_vaccinated_per_hundred               3938 non-null   float64 
 42  people_fully_vaccinated_per_hundred         3860 non-null   float64 
 43  total_boosters_per_hundred                  2070 non-null   float64 
 44  new_vaccinations_smoothed_per_million       4181 non-null   float64 
 45  new_people_vaccinated_smoothed              4181 non-null   float64 
 46  new_people_vaccinated_smoothed_per_hundred  4181 non-null   float64 
 47  stringency_index                            7580 non-null   float64 
 48  population                                  7663 non-null   float64 
 49  population_density                          7663 non-null   float64 
 50  median_age                                  7663 non-null   float64 
 51  aged_65_older                               7663 non-null   float64 
 52  aged_70_older                               7663 non-null   float64 
 53  gdp_per_capita                              7663 non-null   float64 
 54  extreme_poverty                             5348 non-null   float64 
 55  cardiovasc_death_rate                       7663 non-null   float64 
 56  diabetes_prevalence                         7663 non-null   float64 
 57  female_smokers                              7663 non-null   float64 
 58  male_smokers                                7663 non-null   float64 
 59  handwashing_facilities                      766 non-null    float64 
 60  hospital_beds_per_thousand                  7663 non-null   float64 
 61  life_expectancy                             7663 non-null   float64 
 62  human_development_index                     7663 non-null   float64 
 63  excess_mortality_cumulative_absolute        692 non-null    float64 
 64  excess_mortality_cumulative                 692 non-null    float64 
 65  excess_mortality                            692 non-null    float64 
 66  excess_mortality_cumulative_per_million     692 non-null    float64 
 67  ano_mes                                     7663 non-null   object  
dtypes: category(3), float64(58), int32(4), object(3)
memory usage: 4.9 MB

Reduzimos mais um pouco o comprometimento de memória, embora a diferença não tenha sido tão grande. Em dataframes com milhões de linhas e milhares de colunas, no entanto, essa redução pode fazer toda a diferença. Dica: dê uma olhada na função pandas.to_numeric e no argumento downcast caso você não saiba para qual tipo de dado numérico deve ir.

No fim das contas, parte importante da estratégia de otimização na gestão de memória de dataframes vem de conhecermos os tipos mais adequados de dados para cada coluna antes da leitura do dataframe. Isso pode ser feito lendo primeiro um amostra de poucas linhas (e.g., 10 ou 20) do dataframe com o qual iremos trabalhar para entender a base de dados e só a partir daí ler o dataframe completo com a função read_csv (ou similar) e passar uma lista de tipo de dados ao argumento dtype (exemplo do algoritmo aqui). Tente isso em casa! E não se esqueça: fuja de dados do tipo object!

8.6 Exercícios

  1. X

  2. X