PixLore

Unveiling Visual Narratives

Welcome to the GitHub Page of the repository of our state-of-the-art image captioning model. We have combined the strengths of the BLIP-2 (Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models) model with LoRa (Low-Rank Adaptation of Large Language Models) to create an effective and precise image captioning tool. Our dataset, rich in image descriptions, has been automatically labeled using a combination of multi-modal models.

Visual Narratives in Action

Example 1:

Description for Image 1

Google Bard:

Sorry, I can't help with images of people yet.

GPT-4 (Microsoft Bing):

This is a photo realistic image of a group of cyclists on a road. The cyclists are wearing helmets and are on racing bikes. The cyclists are in a line, with the first cyclist in the foreground and the others following behind. The road is lined with trees and there is a white line on the side of the road. The cyclists' faces are blurred to protect their privacy.

Blip-large:

There are three people riding bikes on a road with trees in the background.

Ours:

A group of cyclists are riding their bicycles down a road, creating a thrilling scene. The cyclists are wearing helmets and goggles, indicating that they are taking part in a triathlon. The road is lined with trees, providing a scenic backdrop for the race. The cyclists are riding in a line, with some of them going ahead and others following behind. The atmosphere is filled with excitement and energy as they navigate the road.

Example 2:

Description for Image 2

Google Bard:

The image shows a red stop sign and a blue U-turn sign on a pole at a rural intersection. The stop sign is in the foreground and the U-turn sign is behind it. The signs are clearly visible and easy to identify. The image conveys the message that drivers must stop at the intersection and that U-turns are allowed.

GPT-4 (Microsoft Bing):

This image shows a traffic sign that indicates a stop and a roundabout ahead. The sign is located on the side of a road with a pavement and some bushes. The sky is blue with some clouds and there are some buildings in the background. The image is realistic and has a lot of details.

Blip-large:

There is a stop sign on the corner of a road.

Ours:

A photo of a road with a stop sign on the side. The stop sign is positioned on the side of the road, indicating that it is a designated intersection. The sky above is cloudy, creating a gloomy atmosphere. The road is surrounded by a fence, and there is a curb separating the road from the adjacent area.

Example 3:

Description for Image 3

Google Bard:

The image shows two cats sleeping soundly on a pink blanket, curled up together with their tails intertwined. The cats are best friends, and the image conveys a sense of peace and tranquility.

GPT-4 (Microsoft Bing):

Two furry friends are enjoying a cozy nap on a comfortable couch. They have found the perfect spot to rest their heads: the TV remotes. One is a black and white striped tabby with cute white paws, and the other is an orange and black striped tabby with a fluffy tail. They look peaceful and relaxed as they snooze on the soft pink fabric. They are so adorable that anyone who sees them would want to join them for a cuddle.

Blip-large:

Two cats laying on a couch with remote controls on the back.

Ours:

Two adorable tabby cats are peacefully resting on a cozy couch. One cat is positioned towards the left side of the couch, while the other cat is positioned towards the right side. Both cats are curled up and appear to be enjoying a peaceful nap. On the couch, there is a remote control, possibly used for entertainment purposes. The cats' fur is a beautiful shade of gray, adding to their charm. It seems like they are enjoying a peaceful moment of relaxation on the couch.

Example 4:

Description for Image 4

Google Bard:

This abstract painting features colorful circles arranged in squares. The bright, saturated colors and dynamic composition create a sense of joy and optimism. The painting is reminiscent of the work of Wassily Kandinsky, a pioneer of abstract art.

GPT-4 (Microsoft Bing):

This is an image of a colorful abstract painting that shows the artist's creativity and expression. The painting consists of 16 squares, each with a different color and pattern. The squares are arranged in a 4x4 grid, creating a contrast and harmony among them. The colors used are orange, red, blue, green, purple, and yellow, which are bright and vivid. The patterns are concentric circles, ovals, and rectangles, which are simple and geometric. The background is a light blue color, which creates a sense of space and calmness. The painting is done in a loose, brushy style, which adds texture and movement to the image.

Blip-large:

A painting of a square with many different colors and shapes.

Ours:

A colorful painting depicting a variety of circles and squares is displayed on a wall. The painting is vibrant and eye-catching, with a variety of colors and patterns. The circles and squares create a geometric pattern, adding a touch of creativity to the scene. The painting is a work of art, showcasing the artist's skill and creativity. The wall is covered in a soft, smooth surface, creating a comfortable and inviting atmosphere. The painting is a perfect addition to any room, adding a pop of color and creativity to the space.

Example 5:

Description for Image 6

Google Bard:

The image shows a plate of sliced bananas, bread, and peanut butter on a wooden table. The food is arranged neatly on the plate, and the colors are bright and vibrant. The image conveys a sense of simplicity and nostalgia, and it is a reminder of the simple pleasures in life.

GPT-4 (Microsoft Bing):

The image shows a cozy and delicious breakfast scene on a wooden table. The table is covered with a checkered tablecloth and has a candle burning in the background. Two black mugs with witty messages are filled with hot coffee, and a plate of toast and bananas is ready to be enjoyed. A jar of smooth and creamy peanut butter is also on the table, inviting the user to spread it on the toast. The image conveys a sense of warmth, comfort, and happiness.

Blip-large:

There is a plate of food that includes bananas, toast and coffee.

Ours:

A beautifully set dining table is captured in this photo. On the table, there is a plate filled with a variety of delicious food. The plate is adorned with a variety of breads, including slices of toasted bread, sandwiches, and pastries. Next to the plate, there is a cup filled with a warm beverage, possibly coffee or tea. The cup is topped with a generous amount of whipped cream, creating a delightful combination. On the side of the plate, there is a candle, providing a cozy ambiance. The table is adorned with a tablecloth, adding a touch of elegance to the scene. It seems like a perfect morning or afternoon meal, ready to be enjoyed.

Human Evaluation Metrics

See how PixLore's narratives stand against GPT-4, Google's Bard and BLIP in human evaluation.

The following form was made for random people to answer which of the N captions was better for each image. Here are the results:

Pie Chart
The majority of humans prefered PixLore over other State-of-the-Art models.

Discover PixLore

With BLIP-2 and LoRa fine-tuning at its heart, PixLore aims to redefine how we perceive images. Join us on this journey from sight to story.