Big Data

Learning outcomes:

Big Data is what we refer to whenever we are talking about extraordinarily large collections of data
This data can be structured, unstructred or semi-structured
These data stores or datasets are so large they can't be handled by traditional means, let alone used in any reasonable way
Big data is usually data that is not only extremely large, but also growing very quickly
Big Data is used commonly for things like predictive modeling, AI, Machine Learning and other popular topics

Big data means you can (technically) make better choices because you have more data to work with, better patterns can be found with more data points
Real time data collection and analytics can mean you can move and make choices faster
Having more data and real time data can mean you can automate more of your business to reduce costs
The more data you have, the closer you can tailor your business to your customers

Big Data has characteristics that are referred to as the "3 Vs of big data" by Gartner in 2001
Volume, High volume of data (Much data)
Velocity, Speed of data being generated, tends to be real time (So fast)
Variety, Many sources of data (Such variety)

Advertisements for products and marketing campaigns
Route Software such as Google Maps or Waze that use GPS and real time traffic to help plan your routes
Fraud detection for banks and credit cards
Predicting weather patterns, natural disasters, climate change and early warning systems

The ability to handle large volumes of data
Able to handle real time visualizations that can also be interactive as a bonus
Large amount of data storage are needed, which will require places to store it, backups, people to take care of both the data, the backups, and the security of the system

Apache Hadoop - Stores and processes data, very popular, open source, uses distributed computing to process the data
Apache Spark - unified analytics engine, very popular, open source, uses cluster computing to process the data, lots of inbuilt options for working with your data such as ML, SQL, and even APIs for other languages like Python, R, and Java
Splunk - Data analytics tool that can handle big data, can work with dashboards and data visualizations, also incorporates AI
Tableau - Data visualization tool, very popular in companies, can create lots of different types of charts and has drag-and-drop properties so it's easier to learn, seen commonly as a dashboard
This is not an exhaustive list, just some examples and popular options

NoSQL databases are said to handle bigger data pools by default
Big data can be stored in places besides databases such as a data lake (raw data) instead of data warehouse (processed data) or data mark (data warehouse for a specific purpose)
Data warehouses (or marts) can be databases because the data has been processed, it will need a schema design so the data can be easily worked with
Data lakes can be anything from web logs, social media, or even sensor data collected real time from IoT (Internet of Things) devices from around the world
What's the Difference Between a Data Warehouse, Data Lake, and Data Mart? by AWS

Database scalability is how gracefully you can work with a small amount of data and grow it into a large amount of data, for example taking the data of 100 customers, to 1,000 customers to 100,000 customers
An example of this is the ID vs GUID and how that is used to make the database accessible for multiple people manipulating the data at once
At large scale tradeoffs need to be made to ensure the database works correctly, traditional relational databases will basically fall over if you try and use them for big data
NoSQL is an example of a way we can make some of these tradeoffs that relational databases won't be able to handle

Lack of talent/skills - it's hard to find people that can work with big data tools well, people with experience and the skills needed are in short supply so they can be very expensive to hire and hard to find
Scalability - Infrastructure weaknesses and tech debt will hit FAST if you aren't careful, big data comes in fast, needs to be worked with fast and not all systems and networks can handle it
Quality - Not all data is good data. We can collect data we shouldn't, collect data that isn't helpful and can be hard to organize
Security/Compliance - The more data you have, the bigger the target for hackers and the more valuable you are to bad actors

Streaming data in real time, so the data is processed as a stream instead of in batches so you can get more real time information and analytics
Artificial Intelligence (AI) and Machine Learning (ML) for more automated decision making and responses
Democratization of data and people having more access to their own data, ability to remove their data, download their data, and see what is "known" about them
More no-code and low-code solutions, if AI and ML can be used to help out it might be possible to have tools that more people can use easier with less background knowledge required

Suggested Activities and Discussion Topics:

Think of a specific scenario where a large volume of data is generated (such as online bank transactions, wikipedia edits, sports stats, or sensor data). Estimate the scale of this data in terms of size (gigabytes or terabytes? Larger?). What challenges might arise when dealing with such massive amounts of data?
Can you think of any recent controversies or ethical dilemmas related to big data usage for the topic you picked above?

Share an article related to a topic of your choice that generates a lot of data, explain why you find it interesting and why you picked it. Some examples of topics might be online games, some variety of sports or gaming statistics
Complete this PDF
Activity: Listen to This Podcast That was created using AI from these materials. Transcript for the Podcast What are your thoughts? Did the AI do a good job representing the materials? Did you find any mistakes?
Go through This AI generated study guide, what do you think? Did it capture the week materials well? How did you do on the self quiz? Do you know all the vocab used?

Would you like to see some more classes? Click here