October 23, 2017

TIL About BFG Repo-cleaner

If you ever migrate code from Bitbucket to Github, you will unpleasantly discover that GH does not allow by default fiels larger than 100MB (unless you pay extra for Large File Storage). At that point, you will probably realize that Github isn’t really the right place to store such large files, and that you are better off moving the data to S3 or somewhere else. However, you quickly realize that, even if you git remove the large file, you are unable to push the repo to Github anyways, as the file does not only exist on the current commit, but in all the history. Read more

September 26, 2017

Video: Usos del Machine Learning aplicados al E-commerce

Aquí dejo el video de mi charla “Usos del Machine Learning aplicados al E-commerce” que tuvo lugar en la ENAE Business School como parte del Foro “Ecommerce & Big/Small Data”. En esta charla explico varios algoritmos que se usan hoy en día en Ecommerce así como las librerías que existen para implementarlos. La calidad del video en Youtube no es muy buena, si quereis el video en HD la única forma es a traves del reproductor de la Universidad de Murcia

June 15, 2017

Handle missing categoricals with PMML

PMML, a markup language developed by the Data Mining Group is, in my opinion, a well needed standard in the Data Science ecosystem. PMML is basically an xml format to define Machine learning pipelines, which allows for (sort of) interoperability between different ML Platforms. In particular, I have been working lately with Openscoring, a wonderful software that creates a web server with an easy to use REST api to deploy models and evaluate data with them. Read more

May 8, 2017

Video: Jornadas Data Science en Murcia

El 21 de Abril de 2017, y gracias al apoyo de Centic y del Info de Murcia, unas 80 personas se acercaron a que yo les diera la brasa durante 3 horas sobre todo lo relacionado con Data Science. Aquí dejo el video. Las transparencias las podeis ver en SlideShare.

April 4, 2017

This is what a memory leak looks like

Left, side of this chart, VSZ (virtual memory) and RSS (RAM) over time (obtained via ps) for a process using poor implementation of KafkaClient in java, which is creating a new kafka client per GET request. This is bad. Right side of the chart, current performance once I fixed the previous developer’s code, and implemented a singleton.

Powered by Hugo & Kiss.