├── .gitignore
├── README.md
├── assets
├── 1-dl.PNG
├── 10-chat.PNG
├── 11-gen.png
├── 2-start.PNG
├── 3-play.PNG
├── 4-adjust.PNG
├── 5-chat.PNG
├── 6-chat.PNG
├── 7-chat.PNG
├── 8-chat.PNG
└── 9-chat.PNG
├── boot.sh
├── demovl.py
├── download.sh
├── environment.yml
├── generation
└── seine-v2
│ ├── configs
│ └── demo.yaml
│ ├── diffusion
│ ├── __init__.py
│ ├── diffusion_utils.py
│ ├── gaussian_diffusion.py
│ ├── respace.py
│ └── timestep_sampler.py
│ ├── functions
│ └── video_transforms.py
│ ├── models_new
│ ├── __init__.py
│ ├── attention.py
│ ├── clip.py
│ ├── resnet.py
│ ├── unet.py
│ └── unet_blocks.py
│ ├── requirements.txt
│ ├── seine.py
│ └── slurm_scripts
│ └── run_inference.sh
├── vinci-inference
├── .env
├── README.md
├── app
│ ├── data.py
│ ├── exception
│ │ ├── __init__.py
│ │ └── handler.py
│ ├── global_var
│ │ ├── __init__.py
│ │ └── cache.py
│ ├── main.py
│ ├── models
│ │ ├── __init__.py
│ │ ├── internvl.py
│ │ └── seine.py
│ ├── service
│ │ ├── __init__.py
│ │ ├── internvl.py
│ │ └── seine.py
│ └── util
│ │ ├── __init__.py
│ │ ├── image.py
│ │ └── oss.py
├── boot.sh
├── client
│ ├── internvl.py
│ ├── internvl_sse.py
│ └── seine.py
├── demo
│ └── demo.mp4
└── requirements
│ ├── app.txt
│ └── client.txt
├── vinci-local
├── .gitignore
├── README.md
└── docker
│ ├── README.md
│ ├── boot.sh
│ ├── clone.sh
│ ├── docker-compose-build.yaml
│ ├── docker-compose.yaml
│ ├── minio
│ └── entry.sh
│ ├── mysql
│ └── init.sql
│ ├── nginx
│ └── conf.d
│ │ └── default.conf
│ └── srs
│ └── conf
│ └── vinci.conf
└── vl_open.py
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | *.out
10 | *.log
11 | *.tar
12 | *.jpg
13 | *.png
14 | Vinci-8B-ckpt/
15 | Vinci-8B-base/
16 | seine_weights/
17 |
18 | # Distribution / packaging
19 | .Python
20 | build/
21 | develop-eggs/
22 | dist/
23 | downloads/
24 | eggs/
25 | .eggs/
26 | lib/
27 | lib64/
28 | parts/
29 | sdist/
30 | var/
31 | wheels/
32 | share/python-wheels/
33 | *.egg-info/
34 | .installed.cfg
35 | *.egg
36 | MANIFEST
37 |
38 | # PyInstaller
39 | # Usually these files are written by a python script from a template
40 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
41 | *.manifest
42 | *.spec
43 |
44 | # Installer logs
45 | pip-log.txt
46 | pip-delete-this-directory.txt
47 |
48 | # Unit test / coverage reports
49 | htmlcov/
50 | .tox/
51 | .nox/
52 | .coverage
53 | .coverage.*
54 | .cache
55 | nosetests.xml
56 | coverage.xml
57 | *.cover
58 | *.py,cover
59 | .hypothesis/
60 | .pytest_cache/
61 | cover/
62 |
63 | # Translations
64 | *.mo
65 | *.pot
66 |
67 | # Django stuff:
68 | *.log
69 | local_settings.py
70 | db.sqlite3
71 | db.sqlite3-journal
72 |
73 | # Flask stuff:
74 | instance/
75 | .webassets-cache
76 |
77 | # Scrapy stuff:
78 | .scrapy
79 |
80 | # Sphinx documentation
81 | docs/_build/
82 |
83 | # PyBuilder
84 | .pybuilder/
85 | target/
86 |
87 | # Jupyter Notebook
88 | .ipynb_checkpoints
89 |
90 | # IPython
91 | profile_default/
92 | ipython_config.py
93 |
94 | # pyenv
95 | # For a library or package, you might want to ignore these files since the code is
96 | # intended to run in multiple environments; otherwise, check them in:
97 | # .python-version
98 |
99 | # pipenv
100 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
101 | # However, in case of collaboration, if having platform-specific dependencies or dependencies
102 | # having no cross-platform support, pipenv may install dependencies that don't work, or not
103 | # install all needed dependencies.
104 | #Pipfile.lock
105 |
106 | # poetry
107 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
108 | # This is especially recommended for binary packages to ensure reproducibility, and is more
109 | # commonly ignored for libraries.
110 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
111 | #poetry.lock
112 |
113 | # pdm
114 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
115 | #pdm.lock
116 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
117 | # in version control.
118 | # https://pdm.fming.dev/latest/usage/project/#working-with-version-control
119 | .pdm.toml
120 | .pdm-python
121 | .pdm-build/
122 |
123 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
124 | __pypackages__/
125 |
126 | # Celery stuff
127 | celerybeat-schedule
128 | celerybeat.pid
129 |
130 | # SageMath parsed files
131 | *.sage.py
132 |
133 | # Environments
134 | .env
135 | .venv
136 | env/
137 | venv/
138 | ENV/
139 | env.bak/
140 | venv.bak/
141 |
142 | # Spyder project settings
143 | .spyderproject
144 | .spyproject
145 |
146 | # Rope project settings
147 | .ropeproject
148 |
149 | # mkdocs documentation
150 | /site
151 |
152 | # mypy
153 | .mypy_cache/
154 | .dmypy.json
155 | dmypy.json
156 |
157 | # Pyre type checker
158 | .pyre/
159 |
160 | # pytype static type analyzer
161 | .pytype/
162 |
163 | # Cython debug symbols
164 | cython_debug/
165 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Vinci - An Online Egocentric Video-Language Assistant
2 |
3 |
4 |
5 | > **Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model**
6 | Arxiv, 2024
7 |
8 | ## 💬 TL,DR
9 |
10 | - **Overview**: A real-time, embodied smart assistant based on an egocentric vision-language model.
11 | - **Portable Device Compatibility**: Designed for smartphones and wearable cameras, operating in an "always on" mode.
12 | - **Hands-Free Interaction**: Users engage in natural conversations to ask questions and get responses delivered via audio.
13 | - **Real-Time Video Processing**: Processes long video streams to answer queries about current and historical observations.
14 | - **Task Planning and Guidance**: Provides task planning based on past interactions and generates visual task demonstrations.
15 |
16 | ## 📣 Demo video
17 | [https://github.com/user-attachments/assets/ab019895-a7fe-4a1c-aa91-5a1e06dd4f2b](https://github.com/user-attachments/assets/ab019895-a7fe-4a1c-aa91-5a1e06dd4f2b)
18 |
19 | [https://github.com/user-attachments/assets/6be2aa5c-81bb-4a85-b1cf-f08e30d97903](https://github.com/user-attachments/assets/6be2aa5c-81bb-4a85-b1cf-f08e30d97903)
20 |
21 |
22 |
23 | ## 🔨 Installation
24 | ```
25 | git clone https://github.com/OpenGVLab/vinci.git
26 | conda env create -f environment.yml
27 | ```
28 | Requirements:
29 | - python 3.8 and above
30 | - pytorch 2.0 and above are recommended
31 | - CUDA 11.4 and above are recommended
32 | - Docker is required when deploying streaming demo
33 | - Gradio is required when using local web-based demo
34 |
35 |
36 | ### Downloading Checkpoints
37 | ```
38 | bash download.sh
39 | ```
40 | Running download.sh will take up >100GB disk space.
41 |
42 | ## 🎓 Getting Started
43 | We offer two ways to run our Vinci model
44 |
45 | ### 🎬 Online Streaming Demo
46 | 1. start the frontend, backend and model services:
47 | ```bash
48 | sudo ./boot.sh {start|stop|restart} [--cuda ] [--language chn/eng] [--version v0/v1]
49 | ```
50 |
51 | - --cuda : Specify the GPU devices to run the model
52 | - --language : Choose the language for the demo (default: chn).
53 | - chn: Chinese
54 | - eng: English
55 |
56 | - --version : Select the model version (default: v1).
57 | - v0: Optimized for first-person perspective videos.
58 | - v1: Generalized model for both first-person and third-person perspective videos.
59 |
60 | Then use the browser to access the frontend page:http://YOUR_IP_ADDRESS:19333 (E.g., http://102.2.52.16:19333)
61 |
62 | 2. Push live stream
63 | With an smartphone app or GoPro/DJI cameras, push the stream to: `rtmp://YOUR_IP_ADDRESS/vinci/livestream`
64 |
65 | With a webcam, use the following command: `ffmpeg -f video4linux2 -framerate 30 -video_size 1280x720 -i /dev/video1 -f alsa -i default -vcodec libx264 -preset ultrafast -pix_fmt yuv420p -video_size 1280x720 -c:a aac -threads 0 -f flv rtmp://YOUR_IP_ADDRESS:1935/vinci/livestream`
66 |
67 | #### Interact with Online Video Streaming Demo
68 | 1. Activate Model Service: To wake up the model and begin using it, simply say the wake-up phrase: "你好望舒 (Ni hao wang shu)" (Currently, only Chinese wakeup command is supported)
69 | 2. Chat with Vinci: Once activated, you can start chatting with Vinci with speech. The model will respond in text and speech.
70 | Tip: For the best experience, speak clearly and at a moderate pace.
71 | 3. Generate Predictive Visualizations: If you want to generate a predictive visualization of actions, include the keyword "可视化 (Ke shi hua)" in your command.
72 |
73 | ### 🎬 Gradio Demo for uploaded videos
74 | ```bash
75 | python demovl.py [--language chn/eng] [--version v0/v1]
76 | ```
77 | - --cuda : Specify the GPU devices to run the model
78 | - --language : Choose the language for the demo (default: chn).
79 | - chn: Chinese
80 | - eng: English
81 |
82 | - --version : Select the model version (default: v1).
83 | - v0: Optimized for first-person perspective videos.
84 | - v1: Generalized model for both first-person and third-person perspective videos.
85 |
86 | #### Interact with Gradio Demo
87 | 1. Upload local video file
88 |
89 |
90 |
91 |
92 | 2. Click Upload & Start Chat button to initiate the chat session
93 |
94 |
95 |
96 |
97 | 3. Click the play button to start playing the video
98 |
99 |
100 |
101 |
102 | 4. Adjusting the Stride of Memory. This allows you to control the granularity of the model's memory.
103 |
104 |
105 |
106 |
107 | 5. Real-Time Interaction:Type your questions in the chat box. The model will respond based on the current frame and historical context.
108 |
109 |
110 |
111 |
112 |
113 | Describe current action
114 |
115 |
116 |
117 |
118 |
119 |
120 |
121 |
122 |
123 | Retrieve object from the history
124 |
125 |
126 |
127 |
128 |
129 |
130 |
131 |
132 |
133 | Summarize previous actions
134 |
135 |
136 |
137 |
138 |
139 |
140 |
141 |
142 |
143 | Scene understanding
144 |
145 |
146 |
147 |
148 |
149 |
150 |
151 |
152 |
153 | Temporal grounding
154 |
155 |
156 |
157 |
158 |
159 |
160 |
161 |
162 |
163 | Predict future actions
164 |
165 |
166 |
167 |
168 |
169 |
170 | 6. Generate future videos: based on the current frame and the historical context, the model can generate a short future video.
171 |