Advertisements for products and marketing campaigns
Route Software such as Google Maps or Waze that use GPS and real time traffic to help plan your routes
Fraud detection for banks and credit cards
Predicting weather patterns, natural disasters, climate change and early warning systems
Why Big Data needs different tools/languages
The ability to handle large volumes of data
Able to handle real time visualizations that can also be interactive as a bonus
Large amount of data storage are needed, which will require places to store it, backups, people to take care of both the data, the backups, and the security of the system
What tools do we use with Big Data
Apache Hadoop - Stores and processes data, very popular, open source, uses distributed computing to process the data
Apache Spark - unified analytics engine, very popular, open source, uses cluster computing to process the data, lots of inbuilt options for working with your data such as ML, SQL, and even APIs for other languages like Python, R, and Java
Splunk - Data analytics tool that can handle big data, can work with dashboards and data visualizations, also incorporates AI
Tableau - Data visualization tool, very popular in companies, can create lots of different types of charts and has drag-and-drop properties so it's easier to learn, seen commonly as a dashboard
This is not an exhaustive list, just some examples and popular options
How Big Data is different than databases
NoSQL databases are said to handle bigger data pools by default
Big data can be stored in places besides databases such as a data lake (raw data) instead of data warehouse (processed data) or data mark (data warehouse for a specific purpose)
Data warehouses (or marts) can be databases because the data has been processed, it will need a schema design so the data can be easily worked with
Data lakes can be anything from web logs, social media, or even sensor data collected real time from IoT (Internet of Things) devices from around the world
Database scalability is how gracefully you can work with a small amount of data and grow it into a large amount of data, for example taking the data of 100 customers, to 1,000 customers to 100,000 customers
An example of this is the ID vs GUID and how that is used to make the database accessible for multiple people manipulating the data at once
At large scale tradeoffs need to be made to ensure the database works correctly, traditional relational databases will basically fall over if you try and use them for big data
NoSQL is an example of a way we can make some of these tradeoffs that relational databases won't be able to handle
Some problems with Big Data
Lack of talent/skills - it's hard to find people that can work with big data tools well, people with experience and the skills needed are in short supply so they can be very expensive to hire and hard to find
Scalability - Infrastructure weaknesses and tech debt will hit FAST if you aren't careful, big data comes in fast, needs to be worked with fast and not all systems and networks can handle it
Quality - Not all data is good data. We can collect data we shouldn't, collect data that isn't helpful and can be hard to organize
Security/Compliance - The more data you have, the bigger the target for hackers and the more valuable you are to bad actors
Where Big Data might be going next
Streaming data in real time, so the data is processed as a stream instead of in batches so you can get more real time information and analytics
Artificial Intelligence (AI) and Machine Learning (ML) for more automated decision making and responses
Democratization of data and people having more access to their own data, ability to remove their data, download their data, and see what is "known" about them
More no-code and low-code solutions, if AI and ML can be used to help out it might be possible to have tools that more people can use easier with less background knowledge required
Suggested Activities and Discussion Topics:
In groups or pairs, discuss the following questions
Think of a specific scenario where a large volume of data is generated (such as online bank transactions, wikipedia edits, sports stats, or sensor data). Estimate the scale of this data in terms of size (gigabytes or terabytes? Larger?). What challenges might arise when dealing with such massive amounts of data?
Can you think of any recent controversies or ethical dilemmas related to big data usage for the topic you picked above?
Share an article related to a topic of your choice that generates a lot of data, explain why you find it interesting and why you picked it. Some examples of topics might be online games, some variety of sports or gaming statistics