On The Watch #1: About LLMs Browsing on Your Phone, Unit Tests and Code Reviews
Including a short foreword about my plans for this newsletter
Foreword:
If Python code makes your head hurt or you simply don’t understand it yet, feel free to skip to the text below. It’s just a fun little gimmick and completely optional.
from arxiv import llm_papers
from datetime import date
today = date.today()
current_week = today.isocalendar().week
# A newsletter class focusing on LLM research
class LLMWatch(current_week):
# Initializing the newsletter
def __init__(self):
self.welcome_msg = "Well met, watchers!"
self.article_types = ["research", "applied", "implementation"]
self.weekly_papers = llm_papers(current_week)
# Welcoming all subscribers
def say_hi(self):
print(self.welcome_msg)
# Selecting an article type
def select_type(self):
...
As you can see from the code, the way this newsletter will work isn’t fully defined yet. Just as Software Development - or developing any product - this will be an iterative process. Your feedback will be highly welcomed, as we’re setting out on this journey together!
The vision I currently have for this publication is that I want to focus on three areas: Research. Application. Implementation. Let’s take a look at what I mean with that.
Research: This will build the base of the newsletter. It’s what you’ll be reading today and I plan to release these updates every week.
Application: Application articles will go a little bit deeper. I’ll choose one of the research topics from the newsletter and apply it to real-world use cases. I want these articles to be as practically relevant as possible while using latest advancements from academia. For this, I’ll use existing implementations - the goal will be to showcase how to apply and leverage what is already there.
Implementation: Implementations will be very time-intensive, so I’ll probably only create this type of article when something new comes out for which the researchers didn’t provide code and that looks really amazing yet reasonable to implement.
Obviously, we all love applied original content. But it’s also the most costly to create and I’ll have to see how frequently I’ll be able to provide such articles. At the point of writing this, my substack is completely free. And the Research part will always be. For the other two areas, I’ll try to strike a balance between free and paid. Before I’ll turn on subscriptions though, I want the content to speak for itself first. I won’t take any subscriptions for the first two months.
Thank you all for reading this and for your support,
Let’s kick it off!
In this issue:
LLMs taking control of your smartphone
Never writing unit tests again (I wish)
Training LLMs to look at your code and say “LGTM!”
1. Empowering LLMs to use Smartphones for Intelligent Task Automation
Watching: AutoDroid (paper/code)
What problem does it solve? Traditional bot software is rather limited due to rule-based systems being hard-to-scale and often leads to behavior that will quickly get you flagges as bad actor - even if you aren't doing anything bad - because any excessive non-organic usage pattern will look shady to app owners.
How does it solve the problem? AutoDroid tries to solve Smartphone Task Automation with the power of LLMs. It supports cloud-based user favorites, such as GPT-3.5 and GPT-4, and also local 7B models that can be finetuned to work with specific apps. App data is fed into the LLM’s context in order to simulate memory. Just like in a chat, the model enters a dialogue with the app. Just that this “dialogue” is more complex and consists of several intermediary steps - think of two people not being able to communicate directly. The LLM acts as a guide for the Task Executor and the app sends back feedback after every action.
What’s next? There's certainly still a high barrier to entry to take advantage of this technology and the failure rate is non-trivial. But I'm excited to see smartphoine automation making big strides and personally, I’m looking forward to automating some things on my phone.
2. An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation
Watching: TestPilot (paper)
What problem does it solve? Coding LLMs have been all the rage lately and reasonably so! But with coding assistants evolving, there’s a demand for assistants that can handle additional tasks, such as debugging and designing tests. In terms of unit testing, current Automated Test Generation is suffering from a lack of readability, e.g., due to crude variable naming, and assertions.
How does it solve the problem? Most previous methods utilized conventional techniques based on symbolic execution, evolutionary methods or search. LLMs are trained with human or human-like instructions, so they excel at mimicking natural language and code. This mitigates the two main problems mentioned above: lack of readability and assertions.
What’s next? Tests are often seen as binary - either they pass or they fail. But there’s more to it from the point of a developer. What if a tests fails because it’s the wrong test for your code? Which generated tests are useful and only need a little fixing? Which ones are simply bad generations? Having a more fine-grained evaluation method will be crucial to further improve the user experience.
3. LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning
Watching: LLaMA-Reviewer (paper)
What problem does it solve? While code reviewing can be an effective way to learn collaborative coding, it’s also often perceived as tedious. Current SOTA methods, such as CodeReviewer, are based on pre-trained Transformer models that take up a lot of space (~850MB) and parameters (220M). In times of GPT-4 this might not seem like much, but keep in mind that for coding, we’d ideally want to finetune the model for each of our code bases.
How does it solve the problem? LLaMA itself is way too big with its ~7B parameters. But luckily, Parameter-Efficient-Fine-Tuning (PEFT) is getting better and better. For LLaMA-Reviewer, the researchers explored Prefix-Tuning (PT) and LoRA. The latter performed significantly better - on par with the current SOTA at 26x less parameters taking up 50x less storage space.
What’s next? As this was only done with the smallest version of LLaMA and before LLaMA-2 even existed, there’s still a lot of room for quick improvements. This might be a good time to develop a consumer grade code reviewing software? Sounds like an awesome VSCode plugin (or Cursor if you’re feeling hip).