At Global Cloud-native Opensource Summit 2022(GCOS)， Apache Software Foundation Member， Apache Incubator PMC Member&Apache DolphinScheduler PMC Guo Wei(William) shared his opinion on Apache Data Engineering Projects to Improve Your DataOps Efficiency. As an open source evangelist, William also founded the ClickHouse China User Group to propagate open-source culture in China.
Apache Software Foundation Member
Apache Incubator PMC Member
Apache DolphinScheduler PMC
Founder of ClickHouse China User Group
Vice chairman of the intelligent application service branch of the China Software Industry Association
Top 33 Open Source People in China (from SegmentFault)
TGO Board Member (InfoQ Global CTO community)
Here is the summary of his speech at GCOS 2022 for your reference.
Watch the video:https://www.youtube.com/watch?v=tZqCyk9hJxk
Today, we find that there are so many new data technologies that it’s hard for us to know them all. So what happened to the data field? And how can we deal with so many technologies? I see there are several new concepts to help handle this kind of issue.
One of them is called the Data Fabric, which is generated by IBM and then propagated by Gartner. The key idea of data fabric, in my point of view, is to organize your whole data by active, intelligent metadata. And then you can build your own intelligent data knowledge graph, and use the machine learning algorithm to build your data governance and data marketplace.
And under all the processes, you will meet the DataOps. I will introduce what is DataOps later.
Another concept people call is DataMesh. DataMesh is aimed to create some APIs that focus on people and processes to build your data product. When you use DataMesh, you don’t need to know where the data are stored.
You just create your federated governance APIs and deliver your data domain-based APIs to decentralize your data ownership. It focuses on people and processes.
Modern Data Stack
I notice that many of us are using a Modern Data Stack. Just like the picture below shows, in old days, we usually used the technology of ETL, which means we extract data, transform and then load them into the warehouse when processing multi-data sources.
Things are different now. Nowadays we just extract and load the data to the lake, and then transform the data lake or data warehouse, during which we can control the data quality, and build our intelligence, as well as some operations for scientists or AI machine learning on top of it.
So what is DataOps? In my opinion, DataOps is a key technology for connecting data sources to the data target. It covers data orchestration, data integration, data transformation, and data governance.
This is the layout of big data enterprises. If you have your data sources, you can load your data to big data storage products like Kafka or Pulsar, then you will use a streaming engine like Storm/Flink/Spark Streaming AWS Kinesis/Tencent Oceanus to load them to the OLAP engines, such as Crystal, ClickHouse, Druid, impala, etc. And then you can use your BI tools, and later load them to the data scientist platform.
Today I will introduce two Apache projects on DataOps. One is called Apache DolphinScheduler, which is for data orchestration, and the other one is Apache SeaTunnel, which is aimed to extract and synchronize data from one database to another database.
1 Apache DolphinScheduler
First, I’d like to introduce Apache DolphinScheduler. It is a big data workflow orchestration platform with a powerful DAG interface on K8S or on your premise machines. It’s dedicated to solving complex task dependency in the data pipeline.
It’s very easy to use because you just drag and drop to create a new workflow and maintain your running-time jobs without coding.
Designed with a multi-master and multi-worker architecture, it is very easy to scale and extremely stable. And its cloud-native feature supports multi-cloud, hybrid cloud, and K8s. Python to DAG, MLOPS Orchestration is also available. I will introduce this part later.
Many features of Apache Dolphinscheduler make it possible for easy-task workflow building up, such as visual DAG jobs support simple and easy operation and viewing your task status in real-time. It’s cloud-native and supports multi-task types and abundant dependency types. There are also kinds of log and alert mechanisms to choose from, and the complement feature allows you to refresh historical data.
Workflow Management: Visualized Drag-and-Drop Workflow Configuration
This is the interface of Apache DolphinScheduler. As shown in the picture, you can drag and drop to create your own DAG graph. Perhaps you have your Shell program, Sparks program, or EMR program, which have their dependencies. You can just drag and connect them on this DAG graph, then it will run on K8s and or your machines automatically.
Now Apache DolphinScheduler supports various task types like Shell, MR, Spark, SQL (MySQL, PostgreSQL, Hive, Spark SQL,), Python, Flink, EMR, etc. And also it is equipped with some logical tasks like a subprocess, which means you can bring Shell, EMR, and Spark programs together to create DAG, and then this dag will be a subprocess of another bigger DAG process.
It supports dependent tasks. That means you can depend on other workflow tasks with their condition and switch. If you want to switch to a new program, it’s very easy to create your data process by this kind of task and graph. In one word, it’s very easy to create and design the whole workflow.
Visualization of Running Workflow
Supports Rerun and Retry Tasks and Inspect Task
Monitoring the workflow running status is quite a piece of cake. In this graph below, you can see the details of what happened in your workflow, including success or failure status, schedule time, start time, finish time, logs, etc. When monitoring the log, you only need to log on Apache DolphinScheduler rather than Spark or EMR server.
Supports Rerun and Retry Tasks and Inspect Task
Task Management: 1,2,3 Level of Monitoring Logs
Task status data statistics
Process instance status
Tracking of task execution status
Task execution log online
1. MLOps Orchestration
Above are the basic functions of Apache DolphinScheduler. we have added some new features in it to meet user needs, one of them being MLOps orchestration.
There are so many new machine learning packages in our programs, like Pytorch, Sage Maker, MLDB, MLflow, or Jupyter. When programming your machine learning algorithm, you will find it’s very hard to connect them. For example, if I want to create my data preparation process, I will put the data into Jupyter, and I need Sage Maker later. How can I connect them? Then I can use Apache DolphinScheduler to create the process to connect all the things. With Apache DolphinScheduler, you can prepare data from Spark, throw the data to Jupyter, train your model online by Sage Maker, and verify or retry the data to get back to the data preparation. You can maintain your task one-stop by Apache DolphinScheduler, without coding between different platforms.
Here is an example of MLOps by DolphinScheduler. Just like this picture shows, we build up a stock selection system by using DolphinScheduler to select the top 10 stocks from more than 4,800 stocks in the Chinese stock market by machine learning every day, and monitor the real-time stock selection effect through Observable.
You can drag and drop to prepare your data, train the data on MLflow to deploy your model, evaluate the data by Shell, and then you can get your training result.
It allows data analysts and scientists to easily build and reuse analysis processes.
I notice that some developers want to create a DAG by Python, not only by dragging and dropping to create a process.
So we developed a new feature called PyDolphinScheduler. That means you can use the Python program(Left) to create a DAG graph shown on right.
Then you can monitor the process and run the task on the DAG. It’s easy to do version control, code review, CI/CD, and other operations.
The new cloud-native architecture is a new feature of Apache DolphinScheduler, which supports K8s that profited from the multi-master and multi-worker design.
Recently, we released a K8s operator to optimize workflow running on K8s, making it easier to organize your whole data process by DolphinScheduler.
Apache DolphinScheduler has been used in the production environment of many enterprises, such as IBM, Tencent, Walmart, and McDonald’s.
Apache DolphinScheduler community is growing fast, with more than 370 contributors. You’re very welcome to join the community.
To know more about this smart, easy and coordinative job scheduler, you can refer to the Apache website https://dolphinscheduler.apache.org/, or join the slack channel https://join.slack.com/t/asf-dolphinscheduler/shared_invite/zt-1cmrxsio1-nJHxRJa44jfkrNL_Nsy9Qg.
2 Apache SeaTunnel
The other Apache data engineering project is Apache SeaTunnel, which is in ASF incubating now.
We always meet the problems that how we should extract the data from one data source to our target data sources.
In old days, we may use Scoop to solve this issue, but it retired last year. For the new world, the cloud-native data synchronization projects SeaTunnel is a new choice.
This is a distributed, high-performance data integration platform for the synchronization and transformation of massive data. You can just load your data from any database and sync them to the other databases.
This sounds not easy. For example, if you are using ClickHouse, you will find that the query is too fast to sync the data to ClickHouse. If you want to insert a bunch of data into ClickHouse, it will give so many errors.
SeaTunnel solves this problem by creating a ClickHouse file, and copying the data file to the ClickHouse server.
The performance of this method is 10 times faster than inserting the data to ClickHouse. Now SeaTunnel supports abundant databases to synchronize and integrate data to other data targets. Although you can sync and compute data by Spark or Flink, you have to write your connector between databases. Sea Tunnel supports for over over 40 connectors now.
SeaTunnel is easy to use, saving you from learning Scoop or other languages.
SeaTunnel is favored by many users, too, such as Tencent, OPPO Shopee, etc.
There are mainly 2 scenarios where SeaTunnel is used. One is massive data synchronization. For example, in VIP.com, there are many data sources, such as Kudu, Hudi, ClickHouse, Presto, Kylin, etc, and their own internal data analysis platform, SeaTunnel is used to sync the data sources.
It’s very easy to deal with multiple databases or data sources by SeaTunnel.
If you are interested in this high-performance data sync platform, you’re welcome to join the Slack channel or log on Apache website SeaTunnel.Apache.org.
New Data Community, New Data World!
At present, we found that more and more people are using data beyond data engineers, data scientists, data architects, or ELT developers. Every company is creating its data community, even sales analysts, customer support, board members, or financial analysts are using data broadly. I think DataOps will make it possible for everyone to use data more easily in the future. In my last sentence, I believe we will create a new data community and a new data world in the future.
Linkedin ID: WilliamK2000