Skip to content

Clean data

Clean Data.

Script to clean data from dataset and create new excel sheet with data.

Can be ran from command line

python3 clean_data.py {input_file} {output_file}

Or import clean_data function

from clean_data import clean_data

clean_data(data_frame)

Clean data.

Takes data_frame object and cleans data by using 'applymap' to apply a lamda function to all DataFrame values

1st Lambda function converts values to strings, replaces instances of ND with empty string, and strips whitespace and quotation marks.

2nd Lambda function converts values that should be numeric back to numeric values.

Parameters:

Name Type Description Default
data_frame(object)

Pandas DataFrame object

required

Returns:

Type Description
object

clean_data_frame(object) Pandas DataFrame object

Source code in report_generator/location_formatter/clean_data.py
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
def clean_data(data_frame: object) -> object:
    """Clean data.

    Takes data_frame object and cleans data by using 'applymap' to apply a
    lamda function to all DataFrame values

    1st Lambda function converts values to strings, replaces instances of ND
    with empty string, and strips whitespace and quotation marks.

    2nd Lambda function converts values that should be numeric back to
    numeric values.

    Args:
        data_frame(object): Pandas DataFrame object

    Returns:
        clean_data_frame(object) Pandas DataFrame object

    """
    clean_data_frame = data_frame.applymap(
        lambda x: re.sub(r'^\s*["]*ND["]*\s*$', "", str(x)).strip().strip('"')
    )
    clean_data_frame = clean_data_frame.applymap(
        lambda x: pandas.to_numeric(x, errors="coerce")
        if x.replace(".", "").isdigit()
        else x
    )
    return clean_data_frame

create_data_frame(path_to_dataset)

Create data frame.

Loads excel file and returns Pandas DataFrame obj

Parameters:

Name Type Description Default
path_to_data_set(str)

file path string

required

Returns:

Name Type Description
data_frame object

Pandas DataFrame object

Source code in report_generator/location_formatter/clean_data.py
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def create_data_frame(path_to_dataset: str):
    """Create data frame.

    Loads excel file and returns Pandas DataFrame obj

    Args:
        path_to_data_set(str): file path string

    Returns:
        data_frame(object): Pandas DataFrame object
    """
    data_frame = None
    try:
        data_frame = pandas.read_excel(path_to_dataset)

    except FileNotFoundError as e:
        logger.error(f"Cleaning Data - Failed to open excel file: {e}")

    return data_frame

main(input_file_name, output_file_name)

Clean data main.

Takes input_file_name and output_file_name. Loads data from input_file Cleans data Saves as output_file

Parameters:

Name Type Description Default
input_file_name required
output_file_name required
Source code in report_generator/location_formatter/clean_data.py
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
def main(input_file_name, output_file_name) -> None:
    """Clean data main.

    Takes input_file_name and output_file_name.
    Loads data from input_file
    Cleans data
    Saves as output_file

    Args:
        input_file_name:
        output_file_name:

    """
    # Get filepaths
    cur_dir = os.getcwd()
    path_to_dataset = os.path.join(cur_dir, input_file_name)
    path_to_output_file = os.path.join(cur_dir, output_file_name)

    # load df
    try:
        data_frame = create_data_frame(path_to_dataset)
        # print(data_frame["RangeSize"].head(5))

        # clean df
        clean_data_frame = clean_data(data_frame)
        # print(clean_data_frame["RangeSize"].head())

        # output to new file
        clean_data_frame.to_excel(path_to_output_file, index=False)

    except FileNotFoundError as e:
        print(e)