Table of contents
Introduction
A data cleaning challenge was organized in the data tech space by Promise Chinonso. The FIFA data provided for the challenge was messy and required cleaning to make it usable for analysis. As a participant in this challenge, I utilized Microsoft Excel for the cleaning process, leveraging a data dictionary provided for the project to navigate through the dataset.
About the Data
The dataset obtained from Kaggle.com consisted of 188,979 rows and 77 columns. The columns included various player attributes such as Name, Long Name, photo URL, player URL, Age, OVA, POT, Team, Contract, ID, Height, Weight, foot, BOV, Growth, Joined, Loan Date End, Value, Wage, Release Clause, and several others related to player skills and characteristics.
Steps Taken in Cleaning the Data
Duplicating the Sheet: The first step in cleaning the data involved duplicating the original sheet. This was done to have a backup of the original data and to easily trace back if any mistakes were made during the cleaning process.
Handling Wrong Data Types: The dataset contained columns with incorrect data types, which were corrected to ensure consistency and accuracy in the analysis.
Addressing Inconsistencies with Units: Some columns had different measurements embedded in them, which were standardized to ensure uniformity across the dataset.
Handling Null Values: Null values in the dataset were addressed by either filling them with appropriate values or removing them, depending on the context and impact on the analysis.
Removing Unwanted Columns: Columns that were not relevant to the analysis or contained redundant information were removed from the dataset to streamline it for further analysis.
Dealing with Special Characters: Some columns had special characters, such as "A Circumflex (◌̂)" in names, which could affect sorting and analysis. These special characters were removed to ensure proper handling of the data.
Splitting Columns: The "Player URL" column, which contained player names without special characters, was split into several columns using "/" as a delimiter to extract the names into First Name and Last Name for easier analysis.
THE NAME COLUMN
The image below shows the extracted names from the Playerurl.
OVA and POT (Overall Analysis and Player Potential)
The data dictionary for the FIFA dataset specified that the columns for overall analysis (OVA) and player potential (POT) should be in percentage format. However, these columns were initially in the wrong data type format. To rectify this issue, I standardized the columns by changing their data type to percentage format. This ensured that the OVA and POT columns were accurately represented as percentages, as intended by the data dictionary.
Age Column
The age column in the FIFA dataset was found to be in the wrong data type format. To correct this, I converted the age column to the correct data type. Additionally, since the dataset is 2 years old, I updated the ages by adding "2 years" to make them current. This adjustment ensured that the age column accurately reflected the players' ages at the time of the analysis.
WRONG AGE
RIGHT AGE: Age After ( Data Type = Number, Age up to date)
Club Column
The club column in the FIFA dataset contained errors in the club names, including excessive spaces before the club names and some clubs prefixed with the number "1". To clean up these errors, I used the Trim and Clean functions in Excel to remove leading and trailing spaces and to eliminate any instances of the number "1" before the club names. This cleaning process ensured that the club names were correctly formatted and ready for analysis.
TABLE BEFORE:
TABLE CLEANED:
Height and Weight Columns
The Height and Weight columns in the FIFA dataset required significant cleaning due to the presence of mixed units.
For the Height column:
I converted the "feet and inches" measurements to centimeters (cm) for uniformity.
This involved converting feet to inches (1 foot = 30.48 cm) and then adding the inches to the total.
The final result was the height in centimeters for all players.
For the Weight column:
I converted the "pounds" measurements to kilograms (Kg) for uniformity.
This involved converting pounds to kilograms (1 pound = 0.453592 Kg).
The final result was the weight in kilograms for all players.
These conversions ensured that the Height and Weight columns were standardized and ready for further analysis.
DIRTY TABLE
DIRTY TABLE
The Contract column:
in the FIFA dataset displayed the start and end dates of player contracts for their respective clubs, separated by a "~" symbol. To facilitate easier analysis, I split this column into two separate columns: "Contract Start" and "Contract End".
Additionally, some players were either "on loan" or "free" after their contracts. To standardize the data, I converted all instances of "On loan" and "Free" to null values, ensuring consistency in the dataset for further analysis.
CONTRACT BEFORE
CONTRACT CLEANED
Values, Wages, Release Clause Columns
The Values, Wages, and Release Clause columns in the FIFA dataset contained different denominations, including "Millions (M)", "Thousands (K)", and "Tens". To standardize these values, I removed the Pound sign and then multiplied each value by their respective denomination.
For Values:
Values ending with "M" were multiplied by 1,000,000 to convert them to millions.
Values ending with "K" were multiplied by 1,000 to convert them to thousands.
The formula used for Values was: =IF(RIGHT(cell_number,1)="M",LEFT(cell_number,LEN(cell_number)-1)*1000000,IF(RIGHT(cell_number,1)="K",LEFT(cell_number,LEN(cell_number)-1)*1000,""))
This formula was adjusted for Wages and Release Clause columns to ensure consistent formatting and accurate representation of the data.
DIRTY VALUES
CLEANED VALUES
Value, Wage and Release clause after. After being multiplied their currency was then added
Nationality Column
The Nationality column in the FIFA dataset was already in the correct format and required no cleaning.
Joined Column
The Joined column, which indicates the date each player joined their respective clubs, was also in the correct format and needed no changes.
Joined date column
Weak Foot Rating, International reputation, Skill Moves
These columns contained ratings of 1-5. The values in this column had a "star" sign at the back of each. To clean these, the "star" sign which signified the rating was removed
BEFORE
AFTER
Conclusion
The FIFA dataset presented a significant cleaning challenge, requiring the application of various techniques to ensure its accuracy and usability for analysis. By addressing issues such as incorrect data types, mixed units, special characters, and inconsistent formatting, the dataset has been successfully cleaned and is now ready for analysis. This project has expanded my data cleaning skills and provided valuable experience in handling complex datasets.