Building My Automated Media Tracking System

By Ajay

Disclaimer: I don't manually track every single thing I do. I look at the code for this site probably once every 3 months. I like to over-engineer things. And that's exactly what I did here.

The System

My personal media tracking system combines data from multiple sources: Goodreads, Audible, Spotify, and a physical bookshelf computer vision setup. Here's how each component works and fits together to create my books page!.

Actual Pseudocode Links: (upd 2/3/25)

This is a simplified (ish) version of my previously linked code from my github. Check out the actual, more complicated, pseudocode below:

Goodreads Integration

The Goodreads component scrapes my reading lists using a custom scraper since their API was discontinued. It handles both "currently reading" and "read" shelves, with built-in caching to prevent excessive requests.

# Key Components def goodreads_pipeline(): # Initialize cache store cache = {} last_fetch = 0 # Define data types class Book: title: str author: str cover_url: str # Helper functions def format_author(name: str) -> str: # Convert "Last, First" to "First Last" pass def normalize_url(url: str) -> str: # Clean and standardize cover URLs pass # Main scraping def scrape_shelf(html: str) -> List[Book]: # Use BeautifulSoup/similar # Extract book details # Format data return books # API endpoint def handle_request(): if cache_valid: return cached_data else: new_data = fetch_and_process() update_cache(new_data) return new_data

Audible Integration

Since Amazon shut down their Audible API, I built a custom scraper that works around their anti-bot measures. It works similarly to the Goodreads scraper. This was a fun project to work on, especially because in order to access the correct audible endpoint, I had to boot up a VM and run an Audible app to track down the correct endpoint that would work and wouldn't require AAA proxies.

def audible_pipeline(): # Setup secure session session = create_secure_session() # Configure scraper def setup_scraper(): headers = get_rotating_headers() cookies = load_session_cookies() return ScraperConfig(headers, cookies) # Main scraping logic def fetch_library(): books = [] page = 1 while has_next_page: new_books = scrape_page(page) handle_rate_limits() books.extend(new_books) page += 1 return books # Process results def process_books(books): metadata = extract_metadata(books) cache_results(metadata) return metadata

Physical Bookshelf Scanner

The most complex component uses computer vision to track physical books using a Raspberry Pi setup.

def vision_pipeline(): # Initialize hardware camera = setup_camera() # Image processing def process_image(img): gray = convert_to_grayscale(img) edges = detect_edges(gray) spines = find_book_spines(edges) return extract_titles(spines) # Comparison logic def analyze_changes(): current = capture_shelf() processed = process_image(current) diff = compare_with_previous(processed) save_current_state(processed) return diff

Data Integration Layer

All sources feed into a central processor that uses LLaMA v3 for interpreting and standardizing the data.

def integration_pipeline(): # Data collection def collect_data(): return { 'goodreads': fetch_goodreads(), 'audible': fetch_audible(), 'spotify': fetch_spotify(), 'physical': fetch_vision_data() } # LLaMA processing def process_with_llm(data): cleaned = standardize_format(data) processed = llama_model.process(cleaned) return generate_final_format(processed) # Frontend update def update_frontend(data): json_data = convert_to_json(data) api_response = push_to_api(json_data) verify_update(api_response)

This system runs daily, aggregating data from all sources and maintaining an up-to-date view of my media consumption across platforms. The frontend displays this data in a clean, unified interface on my personal website.

While there's always room for improvement (especially in the computer vision component), this setup has been reliably tracking my reading and listening habits with minimal manual intervention.