Data integrity can be the accuracy of the data, or making sure that the data you have stored is correct.
Integrity can also be the completeness of the data, or making sure that if you are storing information, you are storing all the information
Data integrity can also refer to the method of data storage, or making sure that wherever you have the data such as a hard drive, is working properly
Data integrity can also be guidelines for data retention, or how long you keep dataData integrity can also be guidelines for data retention, or how long you keep data
Examples of Data integrity
Complete Data
Tolkien instead of J. R. R. Tolkien
$1000.30 in your account vs $900ish dollars saved somewhere
Heathcare record says there are allergies but doesn't list them
You were awarded some amount of financial aid
Accurate Data
My dude Tolks instead of Tolkien
You're pretty sure you have $1000 in your account vs you KNOW you have $1034.56 in your account
Healthcare record says there is an allergy to pomegranate but it meant penicillin
You were awarded $20,000 in tuition vs $2,000 in tuition
Incorrect data vs Badly organized
Incorrect Data
Asked for birthday and told colour blue
Wrong allergy written down in heathcare record
Wrong transcript was sent to transfer school
Incorrect amount was put on your credit card for your coffee
Badly Organized Data
Asked for birthday and told dec but meant 12/2/2000
Data that doesn't correspond to what was asked for, such as sometimes first name is first, sometimes last name is first
Odd organizational methods, such as organizing your library database by wordcount, or book weight
Make sure you have a clear plan that everyone who is collecting data can see, including identifying what you need and how it's being measured in clear and hard to misinterpret language
Decide if you are collecting qualitative (Open ended response such as "I feel good today") or quantitative(numbers such as year you were born) data
Have a clear system! Include procedures and tests of the people doing the collecting BEFORE sending them out
Data cleaning is fixing data, removing duplicate data, fixing incorrect data and fixing formatting
Data cleaning usually removes the data that doesn't belong (Important note! Keep copy of ORIGINAL data, just in case)
Quality and validity matters, make sure your data makes sense
We clean data so that when we work with it later, we get better results.
For example, we might see both N/A and "non applicable" but they mean the same, so we could group them together
This can include outliers, but be warned! Outlier isn't the same as incorrect, and your definition of "outlier" might not match someone else's, don't ruin your data for a goal or personal opinion
Examples of how data is cleaned
We might clean data by only looking at specific age ranges, or locations, as part of the process to make sure we're looking at the right group of people
We might clean data by removing duplicates such as making sure everyone only did the survey once
We might clean data by fixing typos, naming conventions, or even just making sure the format is the same for all collected data
We might clean data by checking validity such as does this data make sense? We asked for a name and got a date, or we asked for a birthday and got a colour
How AI is used to clean data
Artificial Intelligence is being used to clean data more and more
We can use programs to clean data, either the program we used to collect the data, or write out own program to look for specific things.
New advances in technology is having AI do this for us automatically! Not always correctly, but automatically! Such as AWS sagemaker or AI validator for Google Sheets
Why data is important to every job
Good data can lead to better choices, bad data isn't helpful
All businesses run on data to some extent, even if it's just sales and goods.
Some businesses run on data more such as social media and more tech focused companies
Data can help make business more efficient, improve quality, and either make more money or help more people depending on the type of company
Examples of how data is used in jobs well
Predicting future sales and stocking of goods
Predicating market trends so that the right amount of goods can be produced or ordered on time
Healthcare uses data to see failures in the system and fix them to help save more people
Data can be used to see trends and either continue or correct them
Laying off people because the data says they don't make money, except those were the people making and checking the good produced so the company can't make money in 6 months
Mars Climate Orbiter was unable to perform the test because the software didn't convert data to metric
2008 housing crash was bad data saying pieces of the market were worth more than they actually were, collapsing pieces of the economy when true values were found out
In small groups or pairs, Consider the following questions:
How can data-driven decisions lead to better outcomes in various fields, such as healthcare or education?
What are some potential risks and challenges associated with the misuse or mishandling of data?
How does data impact privacy, ethics, and security concerns?
How does Artificial Intelligence and Machine Learning affect the topic of your choice?
Include at least 1 article on the topic of data in modern life. In your discussion posting you should link your article, give a quick (5 sentences or less) summary, and include your opinion on the quality of the article using the CRAAP method as described HERE
Activity: Download this PDF and the sample data from here and follow the instructions from the PDF. To do the second half of this activity, you need the data you collected from this activity More sources of data can also be found at the resources page for this course
Would you like to see some more classes?
Click here