# Session title

How to feed LLMs with data from the web

## Session URL

https://staging.webexpo.net/prague2024/sessions/how-to-feed-llms-with-data-from-the-web/

## Session Type

Conference Talk

## Talk Description

All major generative AI models have been trained using data scraped from the web. Applications of large language models (LLMs) often extract web data to provide up-to-date context using Retrieval Augmented Generation (RAG). Unfortunately, reliably collecting online data at scale is challenging due to issues like blocking, dynamic content rendering, and the sheer volume of data. In this talk, Jan will explain how you can establish an efficient web data extraction pipeline, clean the HTML to circumvent the “garbage in, garbage out” problem, and demonstrate how to use this in an LLM application.


For questions and further discussion find the speaker in the Speaker's Corner right after their talk.

## Tags

AI &amp; ML &amp; Bots, Automation, Data, Research

## Session Focused on

Web Development: 100%, Design & UX: 0%, Marketing: 20%, Business: 60%

## Session Presenter

Jan Čurn

## Session Presenter Bio

Jan is the founder and CEO of Apify. He has a lifelong passion for software engineering, earning him an MSc and PhD in computer science, and eventually leading him to founding Apify, a full-stack web scraping platform for developers. Jan is active in the Prague tech community, talks about software, startups, or AI, and regularly hosts events in their rooftop office in Lucerna.

## Session Details

### Date

May 30, 2024 5:00 pm

### Duration

30 Minutes

### Specific Location

Lucerna Cinema

### General Localtion

Lucerna Palace, Prague, Czech Republic