Chapter 2 Data Source

2.1 DATA SOURCE DESCRIPTION

2.1.1 NYPD Complaint Data

Our main dataset, “NYPD Complaint Data”, which we download from the NYC Open Data. The ‘NYPD Complaint Data’ is collected by New York City agencies and other partners.

2.1.2 Covid Case by Date

The supplementary dataset, ’Covid-Case-by-DATE’, which was download from the https://github.com/nychealth/coronavirus-data/tree/master/trends#data-by-daycsv. This folder makes a daily update on the daily, weekly, and monthly data shown published by the Health Department. The group of the the githup contributor was responsible for collecting the data.

2.2 DATA DESCRIPTION

The Dataset “NYPD Complaint Data” includes all the complaint cases reported to New York City from way back to date until nowaday, it has in total of 323,817 rows, and 36 columns, and to better serve the purpose of our project, we select the columns that upon our interest.

2.2.1 Variables

CMPLNT_NUM: Randomly generated persistent ID for each complaint

ADDR_PCT_CD: The precinct in which the incident occurred

BORO_NM: The name of the borough in which the incident occurred

COMPLNY_FR_DT: Exact date of occurrence for the reported event

COMPLNT_FR_DT: Exact time of occurrence for the reported event

LAW_CAT_CD: Level of offense: felony, misdemeanor, violation

LOC_OF_OCCUR_DESC: Specific location of occurrence in or around the premises; inside, opposite of, front of, rear of

OFNS_DESC: Description of offense corresponding with key code

PERM_TYP_DESC: Specific description of premises

SUSP_AGE_GROUP, SUSP_RACE, SUSP_SEX: Description of the suspect

VIC_AGE_GROUP, VIC_RACE, VIC_SEX: Description of the Victim

X_COORD_CD,Y_COORD_CD,Latitude,Longtitude,LAT_Lon,New Georeferenced Column: Description of the geological location

2.3 DATA ISSUE

First, we found that there are limited data available for the year 2020 during the clean-up process. In the actual graphing part, we made some data transformations to hope to get rid of the potential bias caused by limited data. Beyond this, we also found that there are significant unknown or empty entries in the dataset, especially in the biological description for both suspect and victim.