One article to understand: how to choose offline data and real-time data

The people who make data and use data cannot avoid is the timeliness of data. What do offline data and real-time data refer to, and what criteria should be used for business application? Many business products or operations do not understand the difference between the two. Asking for data analysis needs, thinking that the more real-time the better, how can the data team refuse? 1. What is offline data and real-time data? From the generation of data on the business side to analysis or back-feeding business use, it needs to go through a series of cleaning and processing processes, and this process brings the size of the time window, which is the timeliness of the data. According to the size of the data delay, the data can be divided into offline data and real-time data (quasi real-time).

Offline data Offline data generally refers to the date of T-1. For example, today's date T=2021-11-12, then the business data that can be reflected in the data result only includes the previous day's data (yesterday's data). Some people also call it T+1 data. The data date is regarded as T. The names are different, but the essence is that the latest date of the mobile number list data processed today is as of yesterday. 2. Real-time data Real-time data mainly refers to data with low delay, such as millisecond, second, and minute delays, and hour-level delays are called "quasi-real-time data", which is more accurate. For example, you stayed up late to catch the last minute of Double 11 and successfully paid the balance. On the Double 11 real-time statistics screen, the GMV value scrolled again.

What are the differences in processing technology 1. Offline data processing Offline data processing is also called "batch processing". After the data is generated, it will not be cleaned immediately, but ETL will be performed in a fixed period, for example, after 12:00 am every day, the data generated the previous day will be processed. When I was in college, some roommates liked to save up socks and wash them once a week. This is the idea of batch processing. Offline data processing technology is a set of systems that developed earlier in big data and is now very mature. The most common one is Hadoop, which is a software framework that can process large amounts of data in a distributed manner. Data processing in a reliable, efficient, and scalable manner.