back to projects

DataChat

RAG application for discovering ML datasets

Overview

Building applications locally feels rewarding but is only a small part of real development work. The main goal of this project was to not only experiment with a new architecture - Retrieval Augmented Generation (RAG) - but to also deploy an end-to-end application from data collection to deployment.

DataChat is a full stack web application for discovering new machine learning datasets. Users can query the LLM with a data domain, problem they want so solve, or model they plan to use. The system returns a structured output of candidate datasets with the source URL. The application is free, requires no login and all conversation history is saved in local storage.

The idea came from my own challenges deciding what ML project I wanted to work on next. Often finding it difficult to determine the right dataset. It seemed like a fun problem to build a solution for. But the main target was to learn a new architecture and improve my frontend and deployment skills.


Tech Stack

The tech stack for this application was as follows:

  • Frontend: SvelteKit + TypeScript + Tailwind CSS
  • Backend: Python + FastAPI + PydanticAI
  • Database: PostgreSQL + **pgvector
  • Infrastructure: Docker + Nginx + Cloudflare Tunnel + Hetzner VPS
  • LLM: GPT-4.1-nano via OpenRouter
  • Embeddings: OpenAI text-embedding-3-small

A few notes on decisions:

  • Svelte: My first attempt using a JavaScript framework. Svelte seemed natural given its similar syntax.
  • PydanticAI: I wanted to experiment with an agent framework. Pydantic is already integrated deeply with FastAPI.
  • GPT-4.1-nano: Cheap LLM as most of the value comes from retrieval.
  • Hetzner: Cheapest VPS I could find.

Data Collection Pipeline

Before building anything exciting, I had to collect the datasets. Kaggle and HuggingFace provide nice APIs I could use to automate the collection.

I wrote a couple of scripts to call the APIs and gathered the following variables of each dataset:

  • Name
  • Owner(s)
  • Tags
  • Description
  • URL

The data was dumped in PostgreSQL essentially acting as the data lake.

The majority of the datasets had terrible formatting initially. The descriptions were unstructured and included a lot of irrelevant information e.g. YAML front matters, citations, download guides etc.

This severely impacted the information compressed into the embeddings.

I built a two-stage processing pipeline to manage this. The first stage used some absurd looking LLM generated regex to clean up the markdown and cut a lot of useless information. The second stage used a small LLM to augment the descriptions and tags of each dataset.

The cleaned data was inserted into a new PostgreSQL table with pgvector enabled. From here, OpenAI’s small embedding model was used to generate embeddings for each description.


System Architecture

The main architecture flow for the system is displayed below:

DataChat System Architecture Flow Diagram

The main idea behind RAG is to retrieve relevant datasets from the vector database, augment the LLM’s context with the information and generate a response for the user. The response uses the retrieved information and the LLM’s capabilities.

A PydanticAI agent generates a query embedding and performs a similarity search against the datasets using cosine similarity distance (threshold 0.5). This retrieves the top 3 relevant datasets and inserts them into the LLM’s context.

The system instructions specify the response format. The response is then streamed to the frontend. After streaming is complete, the frontend stores the conversation history. It is sent back with following queries allowing multi-turn conversations.

The conversation history of each chat is saved to the user’s local storage.


Deployment

As this was my first attempt deploying an end-to-end application, the learning curve was steep initially. But it provided me with a greater understanding on how to build complete systems.

I used Docker to containerise each component:

  • PostgreSQL
  • FastAPI
  • SvelteKit
  • Nginx

Nginx handles the secure connections and routes traffic where it needs to go. API requests get sent to the backend, everything else goes to the frontend. It’s the single point where all traffic enters the system.

I also tried out Cloudflare Tunnel. This let people access the site while keeping it secure from attacks. The server connects to Cloudflare’s network instead of opening any ports.

The entire application is hosted on a Hetzner VPS. This decision alongside the tiny LLM means everything only costs a couple of pounds a month (assuming I don’t get hit with millions of requests :D)


Lessons For Future Projects

For a problem like this, the chatbot was a mistake.

It will always feel intuitive for conversational tasks but not for dataset discovery. Prompting it feels inherently vague and the stochastic nature of LLMs means the same query returns different results which was quite frustrating.

A better UX may be to use embeddings for search but present results through filtered navigation. Generate hundreds of granular tags using the LLM during dataset preprocessing. Users can then filter down interactively, seeing how many datasets match each combination. The interface displays what’s available rather than forcing users to guess the right prompt.

The LLM could become a last resort if results still aren’t narrow enough. This is an idea I would like to implement with a different project.