Clean Data Module
Clean Data.
Script to clean data from dataset and create new excel sheet with data.
Can be ran from command line.
python3 clean_data.py {input_file} {output_file}
Or import clean_data function.
from clean_data import clean_data
This module contains the following functions: - create_data_frame - clean_data - remove_duplicates - main
clean_data(data_frame)
Clean data.
Takes data_frame object and cleans data by using 'applymap' to apply a lamda function to all DataFrame values.
1st Lambda function converts values to strings, replaces instances of ND with empty string, and strips whitespace and quotation marks.
2nd Lambda function converts values that should be numeric back to numeric values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_frame |
pandas.DataFrame
|
Pandas DataFrame object |
required |
Returns:
| Type | Description |
|---|---|
pandas.DataFrame
|
clean_data_frame (pandas.DataFrame) Pandas DataFrame object |
Source code in report_generator/excel_extraction/clean_data.py
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 | |
create_data_frame(path_to_dataset)
Create data frame.
Loads excel file and returns Pandas DataFrame obj
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path_to_data_set |
str
|
file path string |
required |
Returns:
| Name | Type | Description |
|---|---|---|
data_frame |
pandas.DataFrame
|
Cleaned Pandas DataFrame |
Source code in report_generator/excel_extraction/clean_data.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | |
main(input_file_name, output_file_name)
Clean data main.
Takes input_file_name and output_file_name. Loads data from input_file Cleans data Saves as output_file
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_file_name |
str
|
The input file name string |
required |
output_file_name |
str
|
The output file name string |
required |
Source code in report_generator/excel_extraction/clean_data.py
125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 | |
remove_duplicates(data_frame)
Remove duplicates from data.
Takes a data_frame object and looks for duplicate name combination entries and removes the row from the dataset and puts it into a separate file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_frame |
pandas.DataFrame
|
Pandas Dataframe object |
required |
Returns clear_data_frame (pandas.DataFrame): Cleaned Dataframe for duplicates
Source code in report_generator/excel_extraction/clean_data.py
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 | |