What is Data Gravity?

What is Data Gravity?

When working with larger and larger datasets, moving the data around to various applications becomes cumbersome and expensive. This effect is known as data gravity.

The term data gravity was first coined by Dave McCrory, a software engineer, in trying to explain the idea that large masses of data exert a gravitational pull on IT systems. In physics, natural law says that objects with sufficient mass will pull objects with less mass towards them. This principle is why the moon orbits around the earth, and the earth revolves around the sun.

Consider Data as if it were a Planet or other object with sufficient mass.  As Data accumulates (builds mass) there is a greater likelihood that additional Services and Applications will be attracted to this data. This is the same effect Gravity has on objects around a planet.  As the mass or density increases, so does the strength of gravitational pull.  As things get closer to the mass, they accelerate toward the mass at an increasingly faster velocity.  Relating this analogy to Data is what is pictured below.

Note:  Latency and Throughput apply equally to both Applications and Services

Why does data gravity cause problem?

Data doesn’t literally create a gravitational pull, but smaller applications and other bodies of data seem to gather around large data masses. As data sets and applications associated with these masses continue to grow larger, it becomes increasingly difficult to move. This creates the data gravity problem.

Data gravity hinders an enterprise’s ability to be nimble or innovative whenever it becomes severe enough to lock you into a single cloud provider or an on-premises data center. To overcome the consequences of data gravity, organizations are looking to data services that simultaneously connect to multiple clouds.  

How does this all relate back to Database.com?  If Salesforce.com can build a new Data Mass that is general purpose, but still close in locality to its other Data Masses and App/Service Properties, it will be able to grow its business and customer base that much more quickly.  It also enables VMforce to store data outside of the construct of ForceDB (Salesforce’s core database) enabling knew Adjacent Services with persistence.

The analogy holds with the comparison of your weight being different on one planet vs. another planet to that of services and applications (compute) having different weights depending on Data Gravity and what Data Mass(es) they are associated with.

Data Gravity, Storage, and Cloud Computing

Duplicated data, outside of backups or DR strategies, is wasteful, so maintaining a single big data repository or data lake is the best method to avoid siloed and disparate datasets.

Rather than using a data warehouse, which requires conformity in data, a data lake with appropriate security can handle your raw data and content from multiple data sources.

A data lake with cost-effective scalability seems easy enough, and it can be — depending on the data needs at enterprises. Many organizations have a suitable on-premises data lake, but accessing that data lake from the cloud has several challenges:

  • Latency – The further you are from your cloud, the more latent your experience will be. For every doubling in round trip time (RTT), per-flow throughput is halved. This can cause a greater likelihood of slowdown, especially for data-intensive data analytics that leverages artificial intelligence and machine learning.
  • Connectivity – Ordering and managing dedicated network links, such as AWS Direct Connect or Google Cloud dedicated Interconnect, can be costly. Balancing redundancy, performance, and operational costs is difficult.
  • Support – Operating and maintaining storage systems is generally expensive and complicated enough to require dedicated expert personnel.
  • Capacity – A location and infrastructure plan and budget for growth are required.

On-premises data lakes can address latency by co-locating closer to public cloud locations and by purchasing direct network connections. Still, the cost is prohibitive for midsized companies who wish to leverage the innovative services of multiple clouds. 

Category