Data Mesh provides an alternative to the organizational and architectural and centralized model of data lake with a distributed and decentralized architecture designed to help companies have agility and scalability.
Despite the countless advantages it offers, we often feel that technology, instead of increasing productivity, makes our work more difficult and complex.
This happens when it is poorly implemented, with too strict rules or a logic of operation that does not follow the needs of the user or, again, when it seems to interfere for no reason (just try to move a millimetre a picture into Word to assist
This does not mean that we regret the time when we had to drive a map of paper in our laps, before the advent of satellite navigation, or when we had to browse the yellow pages looking for a hotel, a restaurant or even just the
In short, we need technologies that really help us and with which it is pleasant to work. Data Mesh, like Mr. Wolf in the film .Pulp Fiction, solves problems effectively, quickly and without frills. Let’s see what that means in practice.
Data mesh in summary Data Mesh is a relatively new concept that has become one of the fastest growing trends since a year to this end. Agile Lab, which proposes it as a solution, describes it as the extension of the paradigm shift introduced by microservice architectures applied to data architectures, allowing agile and scalable analysis and machine learning or artificial intelligence. Data Mesh provides an alternative to the organizational and architectural and centralised model of data lake with a distributed and decentralised architecture designed to help companies have agility and scalability, reduce the time-to-market of commercial initiatives, have lower maintenance costs
Data Mesh and real problems
Today the challenge in a modern company is to find reliable data. It’s not just about knowing where the data is, but even if we can really trust them. In the business world, increasingly competitive, avoiding extra expenses and not giving up an opportunity is an absolute must. And you can’t afford these fake steps ♪ just ♪ because you can’t find or trust your data. Today it is highly unlikely that a company does not have the data it needs for a report, a different KPI, for a new initiative of Business Intelligence or Business Analytics, for an analysis to validate a new business proposition and so on.
On the contrary, we seem to be overwhelmed by data and always struggling to manage them; with continuous questions about their quality. Are they reliable? Who created them? Are they from within the holding or have they been purchased? Are they from a different set of data? Are they up to date? I find all the information I need in one place? Am I in a format compatible with my needs? What is the complexity (and therefore the cost) of extracting the information I need from those data? Is there anyone else in the company who has already exploited that data set and, if so, how has his experience been?
There may therefore be problems that need to be resolved before a given data asset can be used. No wonder that happens: every new business analysis, every new BI initiative has its own set of obstacles. The real challenge, however, is the ability to estimate these efforts in advance, so as not to incur budget overlays or other costly delays.
It seems that whenever you try to analyse the situation you can never get a definitive response from the data engineers, as regards the quality of the data or even the time it takes to determine whether the data are adequate. There is no way to set a budget (both in terms of time and resources) to solve problems when you can’t know in advance what problems the data might have and how expensive it would be to find out.
It is absolutely impossible to estimate a Time-To-Market if you do not know the challenges to be faced. So how can you determine whether a product or service will be relevant on the market when there is no way of knowing how long it will take to launch it? Is it a risk to take, or rather block the whole project? This is the kind of puzzle that leads to poor observation and reusability of data, combined with ineffective data governance policies.
What companies are beginning to understand is that the problem of data integration must be addressed as an organizational rather than a technical issue. Business Units are (or should be) responsible for data assets, taking ownership both in a technical and functional sense.
In the last decade, on the contrary, the architectures of Data Warehouse and Data Lake, in all their declinations, have freed data owners from the technical burden and at the same time have maintained the knowledge and competence of these data in the
Unfortunately, a direct consequence was that the central IT (or data engineering team), once the first integration process was implemented, acquired the ownership of such data, thus forcing a centralized property. Here integration is interrupted: potential consumers who could create value from data must now go through the data engineering team, which has no real business knowledge of the data they provide as ETL results. This leads potential consumers to not trust or be able to actually exploit data resources and therefore not be able to produce value in the chain.
The four principles of Data Mesh
This implies that it is now the time not only for another architecture or technology, but rather for a completely new paradigm in the data world to solve all these integration (and organization) problems.
This is where Data Mesh comes in. It is, first of all, a new organisational and architectural model based on the principle of domain-driven design, a concept that has proved to be very successful in the field of micro-services, which is now being applied to data assets to manage them with strategies and strategies.
In other words, it means getting data to work for our needs rather than intervening to solve technical complexity. Data Mesh is both revolutionary, for the results it provides, both evolutionary, as it exploits existing technologies and is not linked to a specific underlying. Let’s now try to understand how it works, from the problem-solving point of view.
The Data Mesh is now based on 4 principles:
Decentralised domain-oriented data ownership and architecture
Data as a product
Self-service data infrastructure as a platform
As already mentioned, one of the most frequent reasons for failures in data strategies is the centralised ownership model, due to its intrinsic… bottleneck shape and the inability to scale upwards.
The adoption of Data Mesh, first of all, breaks this model into a decentralized (domain-driven) model. Domains must be the owners of the data they know and provide to the company. This property must be functional to the business, both from a technical/technological point of view, so that domains can move at their speed, with the technology with which they are most comfortable, while providing valuable results and
Trust and budget
Data as a product is apparently a simple, almost trivial concept. The data are presented as if it were a product, easily identifiable, described in detail (what is, what contains and so on), with public quality metrics and availability guarantee.
Products are more likely to be sold (re-used, if we talk about data) if you can build a trusted relationship with potential consumers, for example by allowing users to write reviews / faqs to help the community share their experience with that resource. The success of a data asset is driven by its accessibility, so data products must provide different access opportunities to meet the needs of consumers (technical and functional), as the more flexibility consumers find, the more likely they will be to exploit this data.
Today it is necessary to have always available previous versions of the data, keeping track of the changes and being able to access in different ways, such as stream type, or flow of events (always that is reasonable and compatible with the type of data), without being bound to the structure of But let’s not focus too much on the technical side. The truly revolutionary aspect is that now data, as a product, have a pre-defined and well-determined price, with obvious effects on the ability to budget and report a project. In a traditional system, even modern as a data lake, all data operations (from insertion, recovery, preparation) are required by data engineers. If they have time and can satisfy the request immediately, it is excellent for the user, but not too much for the company, as it implies that there is overcapacity. This is not efficient and very, very expensive. If they are allocated with the precise capacity to address the daily operational needs and projects already in the yard, any new activities will be postponed accordingly.
In a scenario where IT is a fixed cost for the company, capacity cannot be expanded on demand. Even if it could, it would be almost impossible to determine how much of the additional effort is directly linked to the new project, how much of the work can or could be reused in the future, how much in the end it should have been done and so on. When several new projects begin simultaneously, IT overload can easily become an indirect cost and sometimes out of control. Data Mesh provides immediate visibility of data availability, how (and how) data is used and what the associated cost is. Time and money are no longer unknown (or worse, indeterminate in advance).
The turning point
To understand the meaning of the next two points, • Self-service data infrastructure as a platform • and • Federated computational governance •, we track a parallel with a type of platform that has
This is perhaps too much simplification, but useful for this purpose. Let’s imagine working in logistics and having to book a hotel for a sales meeting. There are some requirements (number of rooms available, price within the budget, on-site conference room, easy parking) and a list of
The company Data Lake contains all the data you need: it lists all the hotels in the city where the meeting will take place. But also the… Yellow Pages of the past. How long does it take to find a suitable hotel using such a directory? It is practically impossible to determine: if the Abbey Hotel, the first on the list, meets all the requirements, not too much, but if you have to get to the Zephir Hotel, among the last ones on the list, things do not work.
Not only are they ordered in a predetermined way (in alphabetical order), but you need to check each structure in sequence, calling the phone to know availability, price and so on. In addition, the amount of information available for each hotel in the Yellow Pages is wildly inconsistent.
It would be nice to exclude a number, so don’t have to call them, but some hotels have bought large spaces where they write if they have a conference room or restaurant, others have only listed a telephone number.
If we want to check the quality of a location, to avoid sending the Sales Manager to a sleazy inn, complexity is growing exponentially, as the uncertainty of how long it will take to understand it. Maybe we’re lucky, and we find a colleague who’s already been there and can give us feedback, otherwise we just have to check in person. When you prevent the company’s sales meeting how do you understand how much it will cost, in terms of time and money, just find the right hotel? As if that were not enough, the Sales Manager is furious because this problem is repeated every year and it is never possible to have an estimate, because the time it took to find a hotel the last time is not indicative of how long it will take next.
If we now replace the word “hotel” with “data” in the previous example, we realize that parallelism may appear a little extreme, but not so far-fetched. A Data Mesh is, in this sense, a booking platform (think of hotels.com or booking.com) where each data producer becomes data owner and, just like a hotel owner, wants to be found by those who use the platform. In order to be displayed in response to a search, however, it is necessary that it respects some rules imposed by federated governance.
The owner of the hotel must list a price (cost) and availability (uptime), as well as a structured description of the hotel (data sets) including the address, category and so on, and all required metadata (is there a parking space? Each of these features becomes easily visible (it is always present and shown in the same place), searchable and can act as a filter. People staying at the hotel (using the data set) also leave reviews, which help both other customers in their choice, and the owner of the hotel in improving the quality of their offer or at least the accuracy of the description.
The self-service aspect is twofold. From the user’s point of view, the Sales department can choose and book directly the hotel, without the need (and paying) of the Logistica (Data Lake Engineer team) help. From the owner’s point of view (hotel or data owner) means that you can choose and advertise yourself which services to offer (air-conditioned rooms, whirlpools, butler service and so on) in order to satisfy and even exceed the wishes and demands of In the data world this second aspect concerns the freedom of Data Producer to choose their own technological path, in accordance with standards approved by federated governance.
Last but not least, Data Mesh architecture involves easy scalability (when all hotels/data sets are available, the system can grow to accommodate those of other cities/include new ones) and reuse. Reuse means that the effort spent to create a solution can, at least in part, be used (reused) to create another solution. Let’s stick to the hotel’s analogy. If it was created last year and now we want to proceed with the same system for B&Bs, there is much that can be exploited and it is not necessary to start from scratch. Of course, the…metadata will be different (the Bed and Breakfasts have no conference rooms), but you can still use the same feedback system as the users, the same technology to collect information about prices and availability, which, once again, will be the
A project and organizational change
That said, it seems Data Mesh is a piece of cake. And this might be true perhaps for large companies, but building a Data Mesh is a massive project. If we only have three or four hotels, it goes without saying, it makes no sense to build a booking platform.
What is important to keep in mind is that a Data Mesh architecture, to express its full potential, requires a profound organisational change in the company. To mention the most obvious aspect, data engineers must migrate from the centre (data lake) to data producers, to guide them in the process of correct…data preparation, complying with federated governance rules and exposing them correctly It also requires a change of mindset, so that the entire company can start to consider data as a product, freeing itself from the limits and bottle necks of a Data Lake, collecting the benefits of a truly distributed architecture and therefore, of the new parades
Luca Maestri, Apple’s Chief Financial Officer, said that people tend to attribute the success of large companies, such as Apple, Amazon, Google or Facebook, to their being creative labs, where a large number of innovative ideas may emerge, but These companies are successful because they are \\’execution machines \\’, in other words, a great idea has no value if you can’t perform it effectively and quickly, respecting times and budgets.
Creating a Data Mesh is a huge enterprise, but it means building the solid foundations that will support the evolution of data-based business. We can have all the data in the world in our Data Lake, but if we can’t exploit it effectively and sustainably, we won’t have any benefit. Since in today’s world it means to stand still, the only way to remain competitive is to create new products, services and solutions for your customers. To be a…execution machine, you need to spend time looking for opportunities, instead of looking for data in your Data Lake, analyzing the market and chasing new customers. Once the goal is reached, the reward can be a relaxing and rewarding weekend, to look at the lake’s placid from your home and remember the time when your Data Lake was equally motionless and not transparent.