Unraveling Crime Patterns in Toronto: A Data-Driven Exploration

6 min readJan 10, 2024

In this comprehensive data analysis and machine learning project, we explore a dataset titled “Police Annual Statistical Report — Arrested and Charged Persons.” This dataset contains valuable information about individuals who have been arrested and charged, including details about their demographics, the type of crime committed, and the geographical location of the incidents. Our primary goal is to gain insights from the data and build a machine learning model to predict the type of crime based on age, sex, and neighborhood.

Dataset Overview:

The dataset consists of 129,374 entries and 11 columns, with each row representing an individual who has been arrested and charged. The columns include various attributes such as ARREST_YEAR, ARREST_COUNT, DIVISION, HOOD_158, NEIGHBOURHOOD_158, SEX, AGE_COHORT, AGE_GROUP, CATEGORY, and SUBTYPE. It’s important to note that the dataset has been preprocessed and cleaned to ensure its quality and reliability.

Data Cleaning and Preprocessing:

Checking for Missing Values: We initially checked for missing values in the dataset and found that there were no missing values in any of its columns.
Examine Unique Values: To identify inconsistencies or irregularities, we analyzed the unique values in categorical columns like DIVISION, HOOD_158, NEIGHBOURHOOD_158, SEX, AGE_COHORT, AGE_GROUP, CATEGORY, and SUBTYPE. This step ensures that similar values are represented consistently.
Handling ‘NSA’ and ‘Unknown’ Entries: ‘NSA’ entries in DIVISION and HOOD_158 were retained for a more comprehensive analysis. ‘Unknown’ entries in AGE_COHORT and AGE_GROUP were removed, as their impact on the dataset was minimal.
Outliers and Data Consistency: We examined the ARREST_COUNT column for outliers and noted that there were significant outliers in the data. We have not yet decided on a specific action for handling these outliers.
Checking for Duplicate Entries: We checked for duplicate entries in the dataset and found that there were no duplicate records, ensuring data integrity.

Exploratory Data Analysis (EDA)

With a clean dataset in hand, we proceeded to the exploratory data analysis phase. In this phase, we aimed to gain insights into various aspects of the data, including trends over the years, differences in crime categories by age or sex, and the geographical distribution of arrests.

Trends Over Time: We analyzed the yearly distribution of arrests and observed a general decline in total arrests from 2014 to 2020, possibly influenced by various factors such as changes in law enforcement practices and social dynamics. Notably, there was a decrease in arrests in 2020, likely related to the global COVID-19 pandemic.

Distribution of Crime Categories: We examined the distribution of different crime categories, with a focus on identifying the most common and less common categories. “Crimes Against the Person” and “Crimes Against Property” emerged as the most common categories, with “Other Criminal Code Violations” also being significant.

Crime Subtype Analysis: Within the categories “Crimes Against the Person” and “Crimes Against Property,” we delved deeper into the distribution of crime subtypes. Notable subtypes included “Assaults,” “Sexual Violations,” “Theft Under $5000,” and “Break and Enter.” This analysis provided a detailed breakdown of the types of offenses within these categories.

Demographic Distribution Analysis: We explored the distribution of arrests based on demographics, focusing on two key attributes: sex and age cohort.

Distribution by Sex: We analyzed the number of arrests for each sex category and found that males constituted the majority of arrests, followed by females. A small number of records had ‘Unknown’ sex.

Distribution by Age Cohort: We examined how arrests were distributed across different age cohorts. The “25 to 34” age group had the highest number of arrests, followed by “18 to 24” and “35 to 44.” This analysis shed light on the age groups most commonly associated with arrests.

Geographical Distribution of Arrests: To complete our geographical analysis, we looked at the distribution of arrests across different neighborhoods (NEIGHBOURHOOD_158) and police divisions (DIVISION). We identified areas with the highest and lowest arrest counts.

Distribution Across Police Divisions: We analyzed the distribution of arrests across various police divisions, highlighting divisions with the highest arrest counts. Division D55 had the highest number of arrests, followed by D43 and D32.

Distribution Across Neighborhoods: We focused on identifying the top neighborhoods with the highest arrest counts. Kensington-Chinatown, York University Heights, Downtown Yonge East, and others were among the top neighborhoods with high arrest counts.

Machine Learning for Crime Prediction

After thoroughly exploring the dataset, we transitioned to the machine learning phase, where our goal was to build a predictive model for crime categories based on age, sex, and neighborhood. This is a challenging task due to the complexity and variability of crime data.

Data Preparation: We prepared the data for machine learning by encoding categorical features like sex and neighborhood. We also considered feature scaling to ensure that all features contributed equally to the model.

Feature Selection: We selected relevant features for the model, which included age, sex, and encoded neighborhood information. We considered these features as potential predictors of crime category.

Model Selection: For our initial model, we chose the Random Forest Classifier due to its versatility and ability to handle both categorical and numerical data. However, we recognized that other models could also be explored.

Model Training and Evaluation: We split the dataset into training and testing sets to train and evaluate the model’s performance. The model achieved an accuracy of approximately 26.43% on the test set. While this accuracy may seem low, it’s important to consider the challenges associated with predicting crime categories, including imbalanced classes, limitations in feature representation, and model complexity.

Challenges and Future Work

Throughout this project, we encountered several challenges, including handling outliers, addressing imbalanced classes, and exploring more sophisticated models. Future work could include:

Outlier Handling: Further investigation and consideration of outlier removal or transformation techniques to improve model performance.
Imbalanced Classes: Implementing techniques like oversampling or using different evaluation metrics to address class imbalance and improve prediction accuracy.
Model Complexity: Experimenting with more advanced machine learning models such as Gradient Boosting, Neural Networks, or Deep Learning architectures to achieve better results.
Feature Engineering: Exploring additional feature engineering techniques to enhance the predictive power of the model.
Hyperparameter Tuning: Fine-tuning model hyperparameters to optimize performance.

Conclusion

In this extensive data analysis and machine learning project, we explored a dataset containing valuable information about arrests and charges in Toronto, Canada. We conducted data cleaning, exploratory data analysis, and built a machine learning model to predict crime categories based on age, sex, and neighborhood. While the initial model’s accuracy was modest, it serves as a starting point for further research and improvement.

By delving into the complexities of crime data and leveraging advanced machine learning techniques, we aim to contribute to the understanding of crime patterns and potentially assist law enforcement agencies in allocating resources effectively. This project highlights the multifaceted nature of data science and the challenges and opportunities it presents in addressing real-world issues.

As we continue to refine our model and explore additional avenues for enhancement, we remain committed to the pursuit of innovative solutions in the field of data science and artificial intelligence.

Links:

Please visit the github link for more details. Thank you

Data link

github link