javisantana.com

https://javisantana.com/fastdata/40-things-I-learned-about-data.html • Oct 9, 2021 18:27

Extracto

Today there are 40 days left to my 40th birdthday. I’ve been working with data for 20+ years now and I feel like trying to summarize what I learned in a few points.

Resumen

Resumen Principal

El contenido presenta una reflexión personal y profesional del autor en Javisantana.com con motivo del próximo 40º aniversario, a solo 40 días de distancia. Con más de 20 años de experiencia trabajando en el ámbito de los datos, el autor busca sintetizar aprendizajes clave acumulados a lo largo de su trayectoria. Este tipo de publicación se enmarca dentro de una tradición digital en la que profesionales consolidados comparten perspectivas acumuladas, ofreciendo una visión valiosa sobre la evolución del sector tecnológico y analítico. El enfoque sugiere una intención de legado y aprendizaje compartido, destacando la madurez profesional y la profundidad de experiencia alcanzada. La cercanía con una milestone personal como el cuarenta cumpleaños aporta una dimensión introspectiva que enriquece el contenido con una perspectiva vital y profesional simultáneamente. Este tipo de reflexiones suelen servir como hitos para replantear objetivos, evaluar logros y compartir conocimientos con la comunidad.

Elementos Clave

Experiencia de más de 20 años en datos: El autor destaca una trayectoria extensa en el campo de los datos, lo que le otorga una perspectiva histórica sobre la evolución del sector tecnológico y analítico.
Reflexión previa al 40º cumpleaños: La proximidad a esta fecha significativa motiva una mirada retrospectiva y prospectiva sobre su carrera profesional y aprendizajes adquiridos.
Intención de resumir aprendizajes clave: Se plantea una síntesis de conocimientos, lo que indica una voluntad de compartir buenas prácticas y lecciones aprendidas con otros profesionales.
Enfoque en legado y transmisión de conocimiento: La publicación refleja una etapa de madurez profesional en la que se valora la importancia de dejar constancia de experiencias y aprendizajes para otros.

Análisis e Implicaciones

Este contenido tiene un valor simbólico y profesional significativo, ya que combina una reflexión personal con una evaluación técnica de décadas de trabajo en un campo en constante evolución. La publicación puede inspirar a otros profesionales a realizar ejercicios similares de autoevaluación y síntesis de conocimientos. Además, puede servir como referencia valiosa para jóvenes profesionales interesados en el desarrollo de una carrera en el análisis de datos.

Contexto Adicional

La práctica de reflexionar sobre hitos profesionales en blogs personales es común entre expertos en tecnología, permitiendo construir una narrativa de carrera que aporta tanto a la comunidad como al propio crecimiento personal. El enfoque en el análisis de datos como área de especialización resalta la importancia creciente de esta disciplina en el entorno digital actual.

Contenido

≗ 40 Things I Learned About Data — @javisantana

Today there are 40 days left to my 40th birdthday. I’ve been working with data for 20+ years now and I feel like trying to summarize what I learned in a few points.

I’ll share one thing every day until I turn 40.

1. It’s hard to capture reality with data

Trying to recreate an accurate version of the reality, no matter what that is or how simple looks like, is hard.

Other way to see it: modeling reality always get complex. There are always small nuances, special conditions, things that changed, edge cases and, of course, errors (which sometimes became features)

The only models I found easy to work with and understand are the ones that reflect computer things.

2. There is no “the best data format”

We format the data to move it around. It could be hundreds of kilometers or a few nanometers but we always need to encode information somehow. I never found “El dorado” of data formats.

Text formats are easy to read by an human but harder and slower to parse.

Binary formats are fast to parse but hard to debug.

XML is a good container but it’s to verbose.

JSON is easy but does not have basic data types.

Serializable formats are not good to keep them in memory but specific formats for in memory operations are not binary compatible with other laguages.

The most important thing I learned is: you need find the right balance between speed, flexibility, compatibility and human-computer interface.

3. Good data models make good products

When the data model is not well designed, everything that goes after feels wrong. You feel like you are doing hacks and tweaks all the time.

When the data model is the right one everything flows, it’s easy to explain, when you make a change it just fits like a good Tetris play. Only time can tell if the data model was the right one. If after some years you still use the same data model (maybe not the same database or same code) you did it right. It’s not that different to cars, buildings, companies…

Designing a good data model takes time, prototypes and a well understanding of the reality your are modelling (see point 1 for more info)

4. The second most important rule of working with data: the faster data is the one you don’t read

As simple as it sounds, most people forget about using one of the most important database features: indices. Well, you also need to think about what’s the actual data you need, a lot of apps are full of select * from table.

The problem is, as your system grows, so do the amount and complexity of queries. Know what data you need becomes harder. To avoid that you need… yes, data about how you query your data.

5. When in doubt, use Postgres as your database.

It’s quite typical when you start a project to decide what DBMS to use. Elastic, Mongo, some key/value like redis, funny things like Neo4J. If you have an use case that clearly fits with a database, fine, otherwise, use postgres or anything relational. Of course, there will be someone that says “but it does not scale”. Anyone who has worked with a system at scale knows there is no storage system that scales well (except it’s simple as hell and is eventually consistent, but not even that)

I love Postgres because of many things: solid, battle tested, support transactions (will write about them), feature complete, fast, it’s not owned by a VC backed company, guided by the comunity, calm and steady progress, great tooling, cloud services providing infra, companies with expertise…

When you pick something funny, you end up developing half of the features a solid RDBMS system provide but just worse.

I decided to use redis as the storage for Tinybird and it’s working great but as the project evolves you miss many of the builtin features postgres provides. Probably a mistake.

6. Behind every null value there is a story

When you join a company, just ask about it, you’ll learn a lot

7. When I try to understand data I always end up using an histogram

When visualizing data you have to pick the right visualization type but before that you need to understand the data.

I start using an avg, then avg plus stddev, then min-max and finally I go with an histogram.

It captures min, max, avg and most important, the data distribution.

8. Analytics it’s a product not a department

When you have people asking for metrics and people extracting then from data. For the same metric you’ll have as many definitions for a metric as people you have in the company.

Reporting is something that requires the same thing as a digital product needs: owners, maintenance, clear definitions, improvements and you know, give people what they want in a way is useful for everybody in the company.

Many companies don’t consider analytics as a first class citizen and end up spending more to have less quality.

9. it’s better to master just one database than be bad at two

It’s tempting to start using another database when you run into a performance problem or the lack of a feature.

There are always ways to make it perform better or solving the problem with a workaround.

You’d be surprised how good your database can perform when you understand the internals. It’s not that bad to do that thing in two steps instead of one.

If you go after the shiny new thing just because you find a small roadblock, you’ll never understand the actual limits of your database and you may never know when there is a real reason to change.

10. Try to use the simplest possible data structure.

A few years ago, one of the websites I was working on went on the front page of Google (yes, that small blue link). The traffic it gets is pretty high.

I had to develop a search functionality. The first thing you’d think is to use the database you are currently using or maybe using a special one, like elastic.

But in this case, I needed to use the database as less as possible to be able to cope with the load.

So I decided to go the simplest way: create an index with an in-memory array where all the words would be stored. I ran a linear search, yes, a simple for loop with the search logic.

was it the best index structure? No if you just think about performance, but it worked, it was simple, easy to maintain and change.

There is always time to make it more advanced. With time you end up loving simple and flat arrays.

BTW, you can read the full story here

11. Learn SQL

You may not like SQL based databases but, the probability dealing with a SQL based system in your career is so high that learning it as a soon as possible will compound.

I didn’t like SQL, I still don’t like it even I work with it every single day, but I have to recognize it’s a handy tool.

12. The third most important rule of working with data: the faster data after the one you don’t read is the one you read (and process) once

In other words, caching is one of the most important features and you should trade processing time by memory (or any kind of storage).

Caching is also applied statistics, you usually use LRU or MRU in combination with some kind of TTL but there are many models to improve caches.

Just gather info about how you data is accessed and run simulations on how well different cache models perform.

13. There is always a schema

You decide it when you write the data or later when you read it, but at some point you need to decide attributes and data types for your data.

When you store data without schema you usually need “armies of engineers who effectively become the schema”. It looks easier because a lot of decisions are postponed.

On the other hand, not choosing the schema accelerates the development quite a lot, that’s why databases like MongoDB became so popular a few years ago.

14. Almost everything was invented years ago

Document based databases: IMS from IBM, 1966, used in the Apollo program Analytics databases: Teradata, 1979

Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation). A prompting service which supplies such information is not

E. F. CODD, IBM Research Laboratory, San Jose, California - 1970

Those are examples, but there are many more. So it worts to spend some time researching old systems to understand better the new ones.