Flaneur Creation Notes

Circular Time Flaneur.001.jpeg

TL; DR#

Our team participated in the Hack Engine event hosted by Jike Company from April 8 to April 10. This article describes the process of our project Flaneur's birth, which is a website that uses artificial intelligence to generate music, sounds, and content, aiming to make users feel like they are walking with an old friend. The author also discusses other projects showcased at the Hack Engine event, including travel planning and general knowledge management.

What is Hack Engine?#

Here is the official introduction to Hack Engine:

Q: What is the difference between Hack Engine and a Hackathon?
A: There is no difference. In addition to Hackathon, we are also an incubator, fund, and alumni entrepreneur network.
Q: Isn't that Y Combinator?
A: Yes.

In simple terms, Hack Engine organized an AI-themed Hackathon event, where a team of up to 5 people needs to complete the development of a small product and demonstrate it within 48 hours.

Our team is also very focused on the development and application of generative AI, and since I am an old user of Jike, we quickly decided to participate. We were curious about how everyone could use AI and wanted to meet the real people behind familiar IDs.

Our team consists of 5 members: backend engineer @ Xiao, frontend engineer @ Edison, designer @ Brant, general helper @ Jason, and we also brought in algorithm engineer @ York from the skiing group (?) as an external aid. Our team had almost no experience participating in hackathons, so we consulted @ Junyu before we started. Pea Pod is probably the earliest company in China to hold hackathons, usually developing a small application within 24 hours; when I asked about the most profound experience, @ Junyu looked up at the sky (calling cloud computing?), recalling a year when the hackathon coincided with a heavy rainstorm in Beijing, and everyone stayed up all night, even considering whether to go out and rescue people. @ Junyu gave me three pieces of advice: the most important thing is to finish development, secondly, it should be interesting, and lastly... that's it.

Since the topic would only be announced on the kickoff day, we didn't do much other preparation. On the engineering side, we deployed a server capable of running Stable Diffusion and prepared two different OpenAI API accounts in case one got blocked. Additionally, based on @ Junyu's advice, we established several principles for selecting topics:

Fun and interesting, small / vertical enough
Can be completed in two days
Or, if it can't be completed in two days, it should be impressive enough (can we just show a video as a demo? ((fake it till you make it))

Copilot for?#

At 9:30 AM on Saturday, the topic was announced: Copilot for X.

Copilot for X, where X = anything, so it didn't say anything at all! To avoid narrowing our thoughts too early, we decided to think individually first and then meet to discuss. I originally wanted to take this opportunity to socialize, but I was surprised to see other teams discussing passionately and even starting to work, wow. By the time we were supposed to meet for discussion, I was already hungry, so we decided to fill our stomachs first.

The event was held in Wujiaochang, which we were not familiar with, and we didn't know where to eat, so we decided to stroll around while discussing, and inspiration struck!

Shanghai is a city very suitable for walking. I remembered a time many years ago when I was walking in Shanghai, probably on a summer evening, strolling alone on Hengshan Road, enjoying the evening breeze and listening to music, feeling very comfortable. The song I was listening to was Shu Qi's Tram, from the Hong Kong album in the LV SoundWalk series (yes, it's Hong Kong). LV selected some iconic locations in Hong Kong, invited local musicians to compose music, and had Shu Qi voice the introductions, integrating stories, which was very pleasant to listen to. However, this series only has three albums in China: Beijing, Shanghai, and Hong Kong, and I finished listening to it quickly. Beijing features Gong Li's voice, while Shanghai features Chen Chong's voice. I highly recommend following the order of locations in the album while listening to the music.

Tram - Shu Qi

Among the entire series, my favorite is still this Tram; Shu Qi's voice is simply too beautiful! So I thought, if we could use AI to generate similar content, and also add AI-generated background music, it should be quite nice.

We discussed it, and it was indeed feasible, and the idea extended further: for example, we could add more timely information, such as the current weather, user's movement status, and step frequency, so that even if you open it at the same location, the content you hear each time would be different; we could also gather more information to introduce nearby landmark buildings.

We sorted out the requirements, and this small product would have the following characteristics:

Completely requires no operation, just open it and use it
Generates background music that matches the walking pace based on current location, weather, movement status, etc.
Introduces the history and stories of nearby neighborhoods in a pleasant female voice, as if a real person is accompanying you on your walk
Will pre-generate content for the current neighborhood and nearby neighborhoods, so you can keep walking without interruption in content

The final effect is a simplified version of LV SoundWalk. Or from another perspective, LV SoundWalk is too elite, with only a limited number of locations. In fact, every inch of land we live on has its own story, and every place deserves its own SoundWalk, so it can also be understood as the democratization of SoundWalk.

We completed the brainstorming and division of labor by 2 PM on Saturday, and got to work!

The Birth of Flaneur#

@ Junyu: The first step in making a product is to buy a domain name

First, we needed to give this product a name.

Shanghai is a very fashionable city. On the day I arrived in Shanghai, I took the subway from the airport to the city center, and as I walked out of the subway station, I saw a very elegantly dressed young woman holding a bouquet of flowers wrapped in an English newspaper. Shanghai is so fashionable! I couldn't help but marvel. Upon closer inspection, I realized I was wrong; it wasn't an English newspaper, but a French one. Shanghai is truly stylish! I couldn't help but marvel.

Since the inspiration came from walking in Shanghai, and we started creating this small product in Shanghai, it definitely needed a fashionable name!

So we named this product Flaneur, which means "strolling" in French, specifically referring to "aimless wandering." Given that Flaneur has no interactive features, this name is indeed very fitting.

The implementation of Flaneur can be summarized in the following steps:

Obtain user status information, such as geographical location and movement status
Gather information related to that geographical location, such as current weather, Wikipedia introductions, and POIs
Use GPT to generate a description that covers the information in #2
Convert the generated description in #3 into a human voice (Shu Qi) using TTS
Generate suitable BGM based on the location, weather, movement status, etc. in #1 & #2; if walking, it should be soothing, and if the user is running, it can be more upbeat
Merge the audio tracks of #4 & #5 for playback
For demonstration purposes, we still need an interface to scroll through the description in #3

There weren't many engineering difficulties; the only problem we encountered was that the webpage couldn't access the user's movement status, so we decided to abandon that information. The interesting part was the AI-related implementation, which also inspired us a lot. The AI part can be divided into three sections: text generation, text-to-speech (TTS), and music generation.

The narration in the LV SoundWalk series is very flavorful, combining local history and characteristics, making it crucial to have GPT generate text in a similar style. I mainly trained GPT. Taking Wujiaochang as an example, I found some Wikipedia materials to use as prompts for GPT, asking it to act as a "tour guide" to introduce Wujiaochang, but the generated text was very "touristy." I thought of adding some more "on-site" descriptions, like "you just passed by an ancient door," which improved the effect a bit, but GPT still couldn't resist saying things like "Welcome to Wujiaochang" or "Hello, old friend," while I believed a good effect would be a gentle female voice directly "breaking" into your ear, starting a conversation without pleasantries (otherwise, I would feel embarrassed, after all, it's a very pleasant-sounding female voice).

Later, I suddenly thought of having GPT act as "introducing the nearby neighborhood to a blind friend," and this setting worked very well! However, GPT always couldn't help but add a line at the end like "Even though you can't see, but..." as a comfort. Following the same line of thought, I adjusted the prompt, and the final prompt and effect were as follows:

TTS was the most challenging part of the entire process! TTS, or text-to-speech, actually has many mature solutions, such as the frequently heard "Family, who understands?" and "Pay attention, this man is called Xiaomei" on Douyin, but Flaneur obviously couldn't use such unrefined voices; it should at least be Gao Yuanyuan! So we researched customized TTS and found two options:

MockingBird: An open-source model that can generate readings of any text with just a few seconds of source audio, but requires self-deployment, and the demo effect is acceptable.
11Labs: Uploading a 10-minute audio file can generate readings of any text, and the effect is stunning! It's paid (looks inexpensive). The downside is that it only supports English.

There are also some domestic vendors' voice customization solutions, but they require 15+ working days and costs of tens of thousands... It seems they are using very outdated technology, so the costs are high.

@ York spent a lot of effort deploying the MockingBird model and optimizing it, but the final effect was still mediocre. We studied the technology behind the model, and MockingBird is based on the previous generation of GAN, which might be the reason for its average performance.

While @ York was struggling with the model, I started playing with 11labs. I first tried Shu Qi's voice + Chinese reading, and the result sounded like a foreigner just taking the HSK. Shu Qi's voice + English was a bit lacking in feeling, not as stunning. So I thought, what if we used a familiar foreign actress? My first thought was Scarlett Johansson and the movie HER.

The result was stunning! Although I couldn't "have" Shu Qi, I unexpectedly got Samantha; what more could I ask for!

Family, who understands!

Music generation was the most ordinary part. Many years ago, there were already software that generated corresponding BPM BGM based on step frequency, so there wasn't much imagination. Plus, time was limited, so I thought it wouldn't take too much effort for background music, and decided to use AI to pre-generate a bunch of different BGM tracks, which could be played during the demo.

Since the interface was very simple, we completed the interface development and API adaptation on the first day; the most time-consuming parts should be the backend handled by @ Xiao and the TTS part handled by @ York. By the evening of the second day, we had all the interfaces working; both days we left the venue at 12 o'clock on the dot, while many teams were still working hard.

Time to go

Are you there? Check out the effect#

Please see our demo video below:

Pretend there is a video

A brief explanation:

You can use it just by opening the webpage, no operation required
All content is AI-generated, including content, music, and Samantha's sexy voice (although the music is pre-generated, it is also AI-generated)
We even embedded a segment of advertisement as an Easter egg (who knows, it might really be commercialized)

Experience address: https://flaneur.polytimeapp.com/
You can also open it by reading the original text at the end; please open it on a mobile device.

After opening, you need to click some bubbles to start playback; the loading is still a bit slow, and the generated information is somewhat monotonous, so please give Flaneur some patience.

When we first conceived it, we were completely inspired by LV SoundWalk. But when I actually used it and heard Samantha's voice introducing, I really wanted to have a conversation with her!

I love walking; sometimes it's for thinking, and sometimes it's with friends. The most comfortable state is actually going to an unfamiliar neighborhood and walking with "a very familiar friend." I often have some strange associations and cold jokes; unfamiliar environments give me more inspiration and clues, and having someone respond while walking and talking is the most comfortable state.

Furthermore, if we add the ability to call the phone's camera, using the CLIP model to understand images and incorporating that as part of the prompt for GPT to generate content, it would really allow Samantha—no, Flaneur—to see what you see. She would truly be like an old friend accompanying you on a walk, listening to your ramblings, walking with you down one street after another. Just like in the movie HER. HER is already a movie from ten years ago, and its filming locations just happen to be in Shanghai.

It feels like a dream coming true, wow.

Demo Day!#

Demo Day is when a hundred teams showcase their results in one day, very cool! Each team only has 5 minutes, and exceeding the time will result in a ruthless interruption, very harsh! I can't wait to see everyone's works, one after another!

Our presentation was scheduled as the fourth to last, and by that time, I was actually quite tired... However, it went very smoothly, and I managed to convey everything I wanted to say, so there wasn't much to write about.

I listened carefully to almost all the projects and took notes on the ones I liked / found interesting / were impressive. I still can't compare to the diligence of my seniors; @ Junyu took detailed notes on each project. Since the organizers probably have some confidentiality considerations, I'll just write some abstract thoughts.

I saw several projects combining AI with technology for good, which I really liked. The recent boom in generative AI has made many friends worry about whether they will be replaced by AI in the future (especially in professions like lawyers, programmers, and investment research), and just yesterday, I was discussing with a friend about technological changes in human history. In fact, each time has been a liberation of humanity itself—short-term, some people may be affected and lose their jobs, but soon it will be discovered that people are actually liberated from "less human jobs" to do "jobs more suited for humans." AI can help marketing accounts generate content, and it can also help visually impaired groups interact with the world more seamlessly.

Many projects focused on travel planning. We also considered this theme but found that we might face the problem of not having data to use. Dynamic pricing for flights, hotel room rates, and even map routes are all constraints, and this data is all controlled by OTA service providers with strict anti-scraping strategies. Therefore, it is very likely to create a beautiful product but have no usable data. Since the advent of the mobile internet era, data has been tightly controlled by large companies and trapped in app islands, leading users to form perceptions like "take a taxi with Didi," "watch videos on Douyin," and "search for boredom on Baidu." However, they are essentially all "information," not "video / text / voice / maps" or "notes / emails / schedules / TODOs." It's hard to say this isn't a "detour of the internet."

The theme that most teams at Hack Engine were interested in was actually "general knowledge management." GPT itself is a language model and lacks logical reasoning ability, while human knowledge actually exists in various logical relationships. The proposition "the Earth is round" is not important; the proposition "gravity causes the Earth's material to gather towards the center, thus forming an approximately spherical Earth" is important. Epistemology defines knowledge as Justified True Belief (JTB), meaning knowledge must meet the following three conditions:

Someone believes something;
This belief is actually true;
This belief is justified.

All three conditions are essential. Here are a few counterexamples: "Gravity causes the Earth's material to gather towards the center, thus the Earth becomes a bagel" (the belief is actually false); "Because there is a hamster running on a wheel inside the Earth, the Earth is round" (the justification for this belief is incorrect).

GPT stands for Generative Pre-Trained Transformer, which is a "large language model." The launch of ChatGPT seems more like a temporary move to capture user attention and data, and it doesn't necessarily mean that the best form/application of GPT is "conversational." Currently, everyone is doing conversation, and I feel a bit led astray by OpenAI. Additionally, as mentioned earlier, GPT lacks logical reasoning ability, so it's unwise to ask GPT knowledge-based questions. I'm sure everyone has seen GPT spouting nonsense with a straight face (hence GPT is also known as "nonsense machine").

On the other hand, I personally believe GPT is very suitable for content processing and generation "under limited information," such as all the original information for Flaneur being provided by us; for example, first help me filter through the Read it later list; for example, generate a summary of an article (the TL;DR at the beginning of this article was written by GPT); for example, help me automatically establish associations in my knowledge base (actually, just embedding is enough); for example, generate a new article based on the fragmented information I wrote, etc...

I call these "general knowledge management," a theme I am very interested in and passionate about. Friends who are also interested in this theme are welcome to communicate!

1476px-Pieter_Bruegel_the_Elder_-The_Tower_of_Babel(Vienna)_-_Google_Art_Project.jpg

Co-creating the giant spirit of humanity

Ending#

The entire Hack Engine was very compact, and the demo was completed on time, even ahead of schedule, with results announced on Monday night. Flaneur was not selected, which still feels a bit regrettable. However, we all enjoyed the process itself over these two days, spending an unforgettable weekend; the weather in Shanghai has been great these days, and the weather forecast initially predicted rain, but *** it turned out to be sunny!

Shanghai seems to always give me similar feelings: a beautiful beginning and process, leaving some regrets at the end.

Anyway, I am very grateful to my team, and extra thanks to @ York for coming from Hangzhou to participate (we actually forgot to take a group photo TAT)

A special thanks to the organizing team at Jike; we had an almost perfect experience on-site, encountering no issues at all, which did not feel like the first time hosting such an event. Hack Engine also paid great attention to details, such as the participant certificates being specially designed; this detail, wow.

More advanced than Byte's employee badges, wishing Jike Company a speedy acquisition by ByteDance.

Although Flaneur was not selected, it still received widespread appreciation, and many friends asked me if Flaneur would continue to be developed, which made us very happy. To be honest, we haven't figured it out yet; making a demo and creating a real product are quite different, and whether the existing technology can meet our expected effects still needs further research. Our team is also facing some difficulties, making it hard to spare energy and resources to create a new product.

Anyway, if you also like Flaneur, please don't hesitate to let us know with your compliments!