├── .dockerignore
├── .flake8
├── .gitignore
├── Dockerfile
├── README.md
├── app.py
├── config.py
├── crawler
    ├── __init__.py
    └── music_crawler.py
├── data
    └── artist_list.txt
├── format_data.py
├── masks
    └── romano.png
├── model.py
├── model
    ├── __init__.py
    ├── input_pipeline.py
    ├── rnn.py
    ├── sample_generator.py
    ├── song_generator.py
    └── song_model.py
├── preprocessing
    ├── __init__.py
    ├── dataset.py
    ├── text_preprocessing.py
    └── tfrecord.py
├── requirements.txt
├── sample.py
├── scripts
    ├── create_sample.sh
    ├── create_wordcloud_graph.sh
    ├── download_musics.sh
    ├── run_format_data.sh
    ├── run_model.sh
    └── run_vagalume_crawler.sh
├── song_word_cloud.py
├── utils
    ├── __init__.py
    ├── progress_bar.py
    └── session_manager.py
├── vagalume_crawler.py
└── vagalume_downloader.py


/.dockerignore:
--------------------------------------------------------------------------------
 1 | .git
 2 | .dockerignore
 3 | __pycache__/*
 4 | scripts/*
 5 | crawler/*
 6 | data/*
 7 | !data/song_dataset/word2index.pkl
 8 | !data/song_dataset/index2word.pkl
 9 | proibidao/*
10 | !proibidao/song_dataset/word2index.pkl
11 | !proibidao/song_dataset/index2word.pkl
12 | kondzilla/*
13 | !kondzilla/song_dataset/word2index.pkl
14 | !kondzilla/song_dataset/index2word.pkl
15 | ostentacao/*
16 | !ostentacao/song_dataset/word2index.pkl
17 | !ostentacao/song_dataset/index2word.pkl
18 | preprocessing/*
19 | vagalume_crawler.py
20 | vagalume_downloader.py
21 | model.py
22 | sample.py
23 | 
24 | 


--------------------------------------------------------------------------------
/.flake8:
--------------------------------------------------------------------------------
1 | [flake8]
2 | exclude = .git,__pycache__,data/,scripts/
3 | max-line-length=100
4 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | data/*
3 | 


--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM debian:testing
 2 | 
 3 | RUN apt-get update -qy && apt-get install python3-pip -qy
 4 | RUN pip3 install tensorflow==1.4.1 flask flask-cors gunicorn
 5 | 
 6 | RUN mkdir -p /funk-generator/
 7 | ADD . /funk-generator/
 8 | WORKDIR /funk-generator/
 9 | 
10 | EXPOSE 5000
11 | 
12 | CMD ["gunicorn", "-b", "0.0.0.0:5000", "app:app"]
13 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | Funk Generator
  2 | ===================
  3 | 
  4 | Decidi criar este projeto quando estava aprendendo sobre como modelos de linguagem usando Deep
  5 | Learning funcionam. Escrevi mais detalhadamente sobre este projeto neste
  6 | [post](https://medium.com/@lucasmoura_35920/mc-neural-o-funkeiro-artificial-ab6fbedc9771) no Medium.
  7 | Além disso, fiz também uma descrição mais técnica do projeto em um
  8 | [post](http://lmoura.me/blog/2018/05/07/funk-generator/) no meu blog. Por fim, o programa está
  9 | rodando em uma [página](http://lmoura.me/funk_generator/) do meu blog. Lá você pode gerar músicas em
 10 | tempo real.
 11 | 
 12 | Neste documento, irei mostrar como fazer para que você consiga rodar o projeto do zero, ou até usar
 13 | outras músicas para treinar o seu modelo.
 14 | 
 15 | Dependências do projeto
 16 | ---------
 17 | 
 18 | Antes de tudo, instale as dependências necessárias para executar este projeto:
 19 | 
 20 | ```sh
 21 | $ pip install -r requirements.txt
 22 | ```
 23 | 
 24 | Coleta das músicas
 25 | ----------
 26 | 
 27 | A lista dos artistas que usei pode ser encontrada na pasta *data*, com o nome de *artist_list.txt*.
 28 | Esse arquivo é usado para fazer um crawler na API do [vagalume](https://api.vagalume.com.br/).
 29 | 
 30 | Para começar o crawler, execute o seguinte comando:
 31 | 
 32 | ```sh
 33 | $ ./scripts/run_vagalume_crawler.sh
 34 | ```
 35 | 
 36 | Este script irá passar por todos os artistas da lista e irá criar um diretório para cada artista na
 37 | pasta *data*. Dentro de cada diretório, será criado dos arquivos distintos:
 38 | 
 39 | * *song_codes.txt*: Um arquivo contendo o código de todas as músicas do artista. Esse código é usado
 40 |   para fazer o download da música em si, usando a API do vagalume.
 41 | * *song_names.txt*: Um arquivo contendo o nome das músicas do artista.
 42 | 
 43 | Uma vez que este script tenha sido executado, pode-se então executar o seguinte script:
 44 | 
 45 | ```sh
 46 | $ ./scripts/download_musics.sh
 47 | ```
 48 | 
 49 | Este script vai entrar em cada diretório que representa um artista e baixar todas as músicas
 50 | presentes no *song_codes.txt* daquele diretório. Cada música é armazenada como um arquivo txt.
 51 | 
 52 | 
 53 | Formatar os dados
 54 | ------------------
 55 | 
 56 | Após o download das músicas, é necessário converter os arquivos de texto para um formato que o
 57 | modelo entenda. Para isso, rode o seguinte script:
 58 | 
 59 | ```sh
 60 | $ ./scripts/run_format_data.sh
 61 | ```
 62 | 
 63 | Esse script irá criar um pasta chamada *song_dataset* dentro da pasta *data*. Dentro dessa pasta,
 64 | terá os arquivos já processados para serem treinados pelo modelo.
 65 | 
 66 | *OBS: Nesse meu projeto eu criei quatro modelos diferentes para tipos diferentes de funk (Kondzilla,
 67 | Proibidão, Ostentção e todas as músicas) Entretanto, essa separação foi feita manualmente. Eu tive
 68 | que decidir quais músicas eram só de Ostentação, por exemplo. Para isso, olhei músicas que tinham
 69 | certos termos característicos e as agrupei uma pasta diferente. Ou seja, aqui essa separação não
 70 | será feita de forma automática. Se você quiser treinar os 4 modelos como eu fiz, terá que fazer esta
 71 | etapa manualmente.*
 72 | 
 73 | Treinamento do modelo
 74 | ----------------------
 75 | 
 76 | Uma vez com os dados formatados, basta rodar o seguinte script:
 77 | 
 78 | ```sh
 79 | $ ./script/run_model.sh
 80 | ```
 81 | 
 82 | Após o treinamento ser concluído, será criado um diretório chamado *checkpoint* na raiz do projeto.
 83 | Esse diretório é o modelo treinado. Caso queria continuar treinando esse modelo gerado, altere a
 84 | variável *USE_CHECKPOINT* no script *run_model.sh*
 85 | 
 86 | 
 87 | Gerar Músicas
 88 | --------------
 89 | 
 90 | Para testar o modelo e gerar algumas músicas, rode o seguinte script:
 91 | 
 92 | ```sh
 93 | ./scripts/create_sample.sh
 94 | ```
 95 | 
 96 | Esse script irá gerar uma música por vez.
 97 | 
 98 | 
 99 | Criar API
100 | ---------------------
101 | 
102 | Ao final, rode o seguinte comando para subir a API do projeto:
103 | 
104 | ```sh
105 | python app.py
106 | ```
107 | 
108 | O aplicação rodará com o servidor default do Flask e permite que você teste o modelo por requisições
109 | POST. Lembre-se que se você só treinar um modelo, o mesmo só reconhecerá o modelo com id 1. Logo,
110 | lembre-se de setar o id da requisição POST como 1 sempre. Por default, o programa será executado na
111 | porta 5000.
112 | 
113 | Além disso, como em produção eu queria que a geração das músicas fosse o mais rápido possível, eu
114 | gerei 4 mil músicas e armazenei elas dentro da minha aplicação (Essas músicas não estão presentes
115 | nesse repositório). Sendo assim, a sua requisição POST não pode ter a variável *sentence* vazia,
116 | pois caso ela esteja vazia, o modelo vai pegar aleatoriamente uma das músicas já armazenadas.
117 | 
118 | Dessa forma, o modelo só gera músicas em tempo real se o valor da variável *sentence* não for vazio.
119 | 
120 | Recomendo que se você queria usar esse modelo em produção como eu fiz, usar outra aplicação para
121 | fazer o servidor, como o [gunicorn](http://gunicorn.org/).
122 | 
123 | 
124 | Container
125 | --------------
126 | 
127 | Caso você queira apenas usar a aplicação sem criar o modelo do zero, pode usar o container que
128 | criei. Para isso é necessário ter o [Docker](https://www.docker.com/) instalado.
129 | 
130 | Uma vez com ele instalado, execute o seguinte comando para baixar a imagem do container:
131 | 
132 | ```sh
133 | $ docker pull lucasmoura/funk-generator
134 | ```
135 | 
136 | E para executar este container:
137 | 
138 | ```sh
139 | $ docker run -d -p 5000:5000 lucasmoura/funk-generator
140 | ```
141 | 
142 | Com o container rodando, basta seguir os passos descritos na seção *Criar API* para usar o programa.
143 | Entretanto, o container tem uma vantagem. Nele existem todos os 4 modelos de funk que criei e também
144 | nele está presente as 4 mil músicas que criei. Dessa forma, você não está restrito a sempre deixar a
145 | variável *id* como 1, e também pode deixar a variável *sentence* vazia, caso queira recuperar
146 | algumas das músicas já criadas.
147 | 
148 | Gere seu próprio modelo com suas próprias músicas
149 | ---------------
150 | 
151 | Para gerar seu próprio modelo, basta com que você mude o arquivo *artist_list.txt* para conter os
152 | artistas que você quiser e depois é só seguir todos os passos já listados.
153 | 
154 | Caso você já tenhas as músicas baixadas, garanta que cada artista tem um diretório próprio e que
155 | todas as músicas desse artista estejam no diretório que o representa. Uma vez isso pronto, basta
156 | continuar à partir da seção *Formatar dados*
157 | 


--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
  1 | import pickle
  2 | import random
  3 | 
  4 | from flask import Flask, request, jsonify
  5 | from flask_cors import CORS
  6 | 
  7 | from model.sample_generator import create_sample
  8 | from config import all_args, kondzilla_args, proibidao_args, ostentacao_args
  9 | 
 10 | 
 11 | app = Flask(__name__)
 12 | CORS(app)
 13 | 
 14 | 
 15 | # load the model
 16 | def load(path):
 17 |     with open(path, 'rb') as f:
 18 |         return pickle.load(f)
 19 | 
 20 | 
 21 | all_songs = load('generated_songs/generated-all-songs.pkl')
 22 | kondzilla_songs = load('generated_songs/generated-kondzilla-songs.pkl')
 23 | proibidao_songs = load('generated_songs/generated-proibidao-songs.pkl')
 24 | ostentacao_songs = load('generated_songs/generated-ostentacao-songs.pkl')
 25 | 
 26 | all_sampler = create_sample(all_args)
 27 | kondzilla_sampler = create_sample(kondzilla_args)
 28 | proibidao_sampler = create_sample(proibidao_args)
 29 | ostentacao_sampler = create_sample(ostentacao_args)
 30 | 
 31 | 
 32 | def get_song(model_id):
 33 |     random_num = random.randint(0, len(all_songs) - 1)
 34 | 
 35 |     if model_id == 1:
 36 |         return all_songs[random_num][1:].replace('\n', '<br>')
 37 |     elif model_id == 2:
 38 |         return kondzilla_songs[random_num][1:].replace('\n', '<br>')
 39 |     elif model_id == 3:
 40 |         return proibidao_songs[random_num][1:].replace('\n', '<br>')
 41 |     else:
 42 |         return ostentacao_songs[random_num][1:].replace('\n', '<br>')
 43 | 
 44 | 
 45 | # API route
 46 | @app.route('/api', methods=['POST'])
 47 | def api():
 48 |     """API function
 49 | 
 50 |     All model-specific logic to be defined in the get_model_api()
 51 |     function
 52 |     """
 53 |     data = request.json
 54 |     model_id = int(data['id'])
 55 |     prime_words = data['sentence']
 56 | 
 57 |     if prime_words == '':
 58 |         output_data = get_song(model_id)
 59 |     else:
 60 |         output_data = -1
 61 |         if model_id == 1:
 62 |             while output_data == -1:
 63 |                 output_data = all_sampler(prime_words, html=True)
 64 |         elif model_id == 2:
 65 |             while output_data == -1:
 66 |                 output_data = kondzilla_sampler(prime_words, html=True)
 67 |         elif model_id == 3:
 68 |             while output_data == -1:
 69 |                 output_data = proibidao_sampler(prime_words, html=True)
 70 |         else:
 71 |             while output_data == -1:
 72 |                 output_data = ostentacao_sampler(prime_words, html=True)
 73 | 
 74 |     return jsonify(output_data)
 75 | 
 76 | 
 77 | @app.route('/')
 78 | def index():
 79 |     return "Index API"
 80 | 
 81 | 
 82 | # HTTP Errors handlers
 83 | @app.errorhandler(404)
 84 | def url_error(e):
 85 |     return """
 86 |     Wrong URL!
 87 |     <pre>{}</pre>""".format(e), 404
 88 | 
 89 | 
 90 | @app.errorhandler(500)
 91 | def server_error(e):
 92 |     return """
 93 |     An internal error occurred: <pre>{}</pre>
 94 |     See logs for full stacktrace.
 95 |     """.format(e), 500
 96 | 
 97 | 
 98 | if __name__ == '__main__':
 99 |     # This is used when running locally.
100 |     app.run(host='0.0.0.0', debug=True)
101 | 


--------------------------------------------------------------------------------
/config.py:
--------------------------------------------------------------------------------
 1 | from collections import defaultdict
 2 | 
 3 | 
 4 | default_args = {
 5 |     'use_checkpoint': True,
 6 |     'embedding_size': 300,
 7 |     'num_layers': 3,
 8 |     'num_units': 728
 9 | }
10 | 
11 | all_args = {
12 |     'checkpoint_path': 'checkpoint',
13 |     'index2word_path': 'data/song_dataset/index2word.pkl',
14 |     'word2index_path': 'data/song_dataset/word2index.pkl',
15 |     'vocab_size': 12551,
16 | }
17 | all_args = {**default_args, **all_args}
18 | all_args = defaultdict(int, all_args)
19 | 
20 | kondzilla_args = {
21 |     'checkpoint_path': 'kondzilla_checkpoint',
22 |     'index2word_path': 'kondzilla/song_dataset/index2word.pkl',
23 |     'word2index_path': 'kondzilla/song_dataset/word2index.pkl',
24 |     'vocab_size': 2192,
25 | }
26 | kondzilla_args = {**default_args, **kondzilla_args}
27 | kondzilla_args = defaultdict(int, kondzilla_args)
28 | 
29 | proibidao_args = {
30 |     'checkpoint_path': 'proibidao_checkpoint',
31 |     'index2word_path': 'proibidao/song_dataset/index2word.pkl',
32 |     'word2index_path': 'proibidao/song_dataset/word2index.pkl',
33 |     'vocab_size': 1445,
34 | }
35 | proibidao_args = {**default_args, **proibidao_args}
36 | proibidao_args = defaultdict(int, proibidao_args)
37 | 
38 | ostentacao_args = {
39 |     'checkpoint_path': 'ostentacao_checkpoint',
40 |     'index2word_path': 'ostentacao/song_dataset/index2word.pkl',
41 |     'word2index_path': 'ostentacao/song_dataset/word2index.pkl',
42 |     'vocab_size': 2035,
43 | }
44 | ostentacao_args = {**default_args, **ostentacao_args}
45 | ostentacao_args = defaultdict(int, ostentacao_args)
46 | 


--------------------------------------------------------------------------------
/crawler/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lucasmoura/funk_generator/7b3b978d25ea8731cf37e519a41c59116f1fc0c6/crawler/__init__.py


--------------------------------------------------------------------------------
/crawler/music_crawler.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import time
  3 | import requests
  4 | import unicodedata
  5 | 
  6 | from bs4 import BeautifulSoup
  7 | from pathlib import Path
  8 | 
  9 | 
 10 | def remove_accented_characters(name):
 11 |     name = unicodedata.normalize('NFD', name).encode('ascii', 'ignore')
 12 |     return name.decode('ascii')
 13 | 
 14 | 
 15 | def clean_name(name):
 16 |     name = name.strip()
 17 |     name = name.lower()
 18 |     name = name.replace(' ', '-')
 19 |     name = name.replace('/', '')
 20 |     name = remove_accented_characters(name)
 21 | 
 22 |     return name
 23 | 
 24 | 
 25 | class MusicCrawler:
 26 | 
 27 |     def __init__(self, artist_list_path, data_folder):
 28 |         self.artist_list_path = artist_list_path
 29 |         self.data_folder = Path(data_folder)
 30 |         self.artists = None
 31 | 
 32 |         self.vagalume_url = 'https://www.vagalume.com.br/'
 33 | 
 34 |     def remove_accented_characters(self, artist_name):
 35 |         artist_name = unicodedata.normalize('NFD', artist_name).encode('ascii', 'ignore')
 36 |         return artist_name.decode('ascii')
 37 | 
 38 |     def parse_artist_name(self, artist_name):
 39 |         return clean_name(artist_name)
 40 | 
 41 |     def load_artists(self):
 42 |         with open(self.artist_list_path, 'r') as artist_file:
 43 |             artists = artist_file.readlines()
 44 | 
 45 |             self.artists = [self.parse_artist_name(artist) for artist in artists]
 46 | 
 47 |     def find_tracks_list(self, html_page):
 48 |         songs = []
 49 | 
 50 |         tracks = html_page.find('ul', {'class': 'tracks'})
 51 |         tracks_hrefs = tracks.findAll('a', href=True)
 52 | 
 53 |         for track_href in tracks_hrefs:
 54 |             name = track_href.get_text()
 55 |             code = track_href.get('data-song')
 56 | 
 57 |             if not code or not name:
 58 |                 continue
 59 | 
 60 |             songs.append((name, code))
 61 | 
 62 |         return songs
 63 | 
 64 |     def get_artist_songs(self, artist_name):
 65 |         artist_url = self.vagalume_url + artist_name
 66 | 
 67 |         response = requests.get(artist_url)
 68 |         parsed_response = BeautifulSoup(response.content, 'html.parser')
 69 | 
 70 |         artist_songs = self.find_tracks_list(parsed_response)
 71 | 
 72 |         return artist_songs
 73 | 
 74 |     def save_data(self, save_path, save_list):
 75 |         with save_path.open(mode='w') as f:
 76 |             for name in save_list:
 77 |                 f.write(name + '\n')
 78 | 
 79 |     def save_artist_songs_info(self, artist, artist_songs):
 80 |         song_names = [name for name, code in artist_songs]
 81 |         song_codes = [code for name, code in artist_songs]
 82 | 
 83 |         save_path = self.data_folder / artist
 84 | 
 85 |         if not save_path.exists():
 86 |             save_path.mkdir()
 87 | 
 88 |         save_path_names = save_path / 'song_names.txt'
 89 |         save_path_codes = save_path / 'song_codes.txt'
 90 | 
 91 |         self.save_data(save_path_names, song_names)
 92 |         self.save_data(save_path_codes, song_codes)
 93 | 
 94 |     def crawl_musics(self):
 95 |         self.load_artists()
 96 | 
 97 |         for artist in self.artists:
 98 |             print('Getting songs of {}...'.format(artist))
 99 | 
100 |             artist_songs = self.get_artist_songs(artist)
101 |             self.save_artist_songs_info(artist, artist_songs)
102 | 
103 | 
104 | class MusicDownloader:
105 | 
106 |     def __init__(self, key_file_path, data_folder, code_file_name):
107 |         self.key_file_path = key_file_path
108 |         self.data_folder = Path(data_folder)
109 |         self.code_file_name = code_file_name
110 | 
111 |         self.api_url = 'https://api.vagalume.com.br/search.php?musid={}&apikey{}'
112 | 
113 |     def load_api_key(self):
114 |         with open(self.key_file_path, 'r') as key_file:
115 |             self.api_key = key_file.read().strip()
116 | 
117 |     def load_codes(self, codes_path):
118 |         with codes_path.open() as code_file:
119 |             codes = code_file.readlines()
120 |             codes = [code.strip() for code in codes]
121 | 
122 |         return codes
123 | 
124 |     def make_request(self, code):
125 |         return requests.get(self.api_url.format(code, self.api_key))
126 | 
127 |     def save_songs(self, songs, artist_path):
128 |         for song_name, song in songs:
129 |             song_name += '.txt'
130 |             song_path = artist_path / song_name
131 | 
132 |             with song_path.open(mode='w') as song_file:
133 |                 song_file.write(song)
134 | 
135 |     def download_songs(self, artist_name):
136 |         codes_path = self.data_folder / artist_name / self.code_file_name
137 |         codes = self.load_codes(codes_path)
138 |         songs = []
139 | 
140 |         for code in codes:
141 |             self.make_request(code)
142 |             json_response = self.make_request(code)
143 | 
144 |             song_name, song = self.parse_json_response(json_response)
145 |             songs.append((song_name, song))
146 |             time.sleep(3)
147 | 
148 |         artist_path = self.data_folder / artist_name
149 |         self.save_songs(songs, artist_path)
150 | 
151 |     def clean_song_name(self, song_name):
152 |         return clean_name(song_name)
153 | 
154 |     def parse_json_response(self, json_response):
155 |         json_dict = json_response.json()
156 | 
157 |         song = json_dict['mus'][0]['text']
158 |         song_name = json_dict['mus'][0]['name']
159 | 
160 |         return self.clean_song_name(song_name), song
161 | 
162 |     def download_all_songs(self):
163 |         self.load_api_key()
164 | 
165 |         for artist in os.listdir(str(self.data_folder)):
166 | 
167 |             artist_path = self.data_folder / artist
168 |             if not artist_path.is_dir():
169 |                 continue
170 | 
171 |             print('Downloading songs from {}'.format(artist))
172 |             self.download_songs(artist)
173 | 


--------------------------------------------------------------------------------
/data/artist_list.txt:
--------------------------------------------------------------------------------
  1 | 3zi
  2 | Allycia
  3 | Amaral Mc
  4 | Amilcka e Chocolate
  5 | Andrezinho Shock
  6 | Androw
  7 | Anitta
  8 | Antunes
  9 | Arquiteto do Amor
 10 | As Danadinhas
 11 | As Experimentas
 12 | As Tchutchucas
 13 | As Tequileiras do Funk
 14 | Babi
 15 | Backdi e Bio G3
 16 | Backdi e Bio-g3
 17 | Banda Aerosol
 18 | Berti
 19 | Biel
 20 | Bob Rum
 21 | Bola de Fogo
 22 | Bonde Da Madrugada
 23 | Bonde Das Gramadas
 24 | Bonde Das Maravilhas
 25 | Bonde Do Macaco
 26 | Bonde Do Ratão
 27 | Bonde Life Stronda
 28 | Bonde Nervoso
 29 | Bonde Neurose
 30 | Bonde R300
 31 | Bonde Tesão
 32 | Bonde da Oskley
 33 | Bonde das Dancinhas
 34 | Bonde das Impostora
 35 | Bonde do Canguru
 36 | Bonde do Come Quieto
 37 | Bonde do Mainstream
 38 | Bonde do Rolê
 39 | Bonde do Tigrão
 40 | Bonde do Vinho
 41 | Bonde dos Perversos
 42 | Bonde dos magrinhos
 43 | Bruno Moreira
 44 | Buchecha
 45 | Bó do Catarina
 46 | Cabeçudos
 47 | Careca e Pixote
 48 | Caroline Miranda
 49 | Casa de Farinha
 50 | Chiquinho e Amaral
 51 | Ciclone
 52 | Cidinho e Doca
 53 | Claudinho
 54 | Claudinho Buchecha
 55 | Creidi
 56 | Cru
 57 | Cubanu´s
 58 | DJ Cris
 59 | DJ Malboro
 60 | DJ R7
 61 | DJ San e Dieguinho G
 62 | DJ Thiago
 63 | Danda E Taffarel
 64 | Dandara
 65 | Dani Brinks
 66 | Dani Russo
 67 | Danilo e Fabinho
 68 | David Bolado
 69 | Deize Tigrona
 70 | Dennis Dj
 71 | Dinho da Vp
 72 | Diogo Martins
 73 | Dj DANIEL
 74 | Dj Dennis
 75 | Dj Filé
 76 | Dj Marlboro
 77 | Dj Tiago
 78 | Dr. Bacalhau
 79 | Dream Team do Passinho
 80 | Duda do Marapé
 81 | Ed Pupone
 82 | Edu Gueda
 83 | Edy Lemond
 84 | Efeito Contrário
 85 | Estação Zero
 86 | Fabão Brazil
 87 | Fagner Pinheiro
 88 | Formiga DJ
 89 | Furacão 2000
 90 | Fábio Sina
 91 | Fábrica da Arte
 92 | Gaiola Das Popozudas
 93 | Gaab
 94 | Gorila e Preto
 95 | Grafitte 07
 96 | Igor Almeida
 97 | Jah Mai
 98 | Jaqueline Cindy
 99 | Jaula Das Gostosudas
100 | Jerry Smith
101 | Jess
102 | Jherry
103 | Jojo Maronttinni
104 | Jonathan Costa
105 | Juliana e As Fogosas
106 | Junior Launther
107 | Justiceiras Do Funk
108 | Karinna Spencer
109 | Kula
110 | Labanca
111 | Lady Fortunato
112 | Larica Dos Mulekes
113 | Leandro e As Abusadas
114 | Lenny B
115 | Leo Kiss
116 | Leo Sannttos
117 | Lippe
118 | Los Torrones
119 | Louco de Refri
120 | Lucas Angelo
121 | Lucca Venuci
122 | Ludmilla
123 | Malas On Line
124 | Malha Funk
125 | Malibu
126 | Marcinho e Cacau
127 | Max Pierre
128 | Max Rocha
129 | Maíra Brasil
130 | Mc 2k
131 | Mc Ale Soares
132 | Mc Alexandre
133 | Mc Andinho
134 | Mc Andrewzinho
135 | Mc Andrezinho do Complexo
136 | Mc Arthur o Sheik
137 | Mc B o
138 | Mc B.ó
139 | Mc Babi
140 | Mc Bahea
141 | Mc Barriga
142 | Mc Batata
143 | Mc Bella
144 | Mc Bellot
145 | Mc Belzinho
146 | Mc Biel NPF
147 | Mc Bielzinho
148 | Mc Biju
149 | Mc Bin Laden
150 | Mc Biro Leyby
151 | Mc Bobô
152 | Mc Bocarra
153 | Mc Bola
154 | Mc Bolado
155 | Mc Boy do Charmes
156 | Mc Brinquedo
157 | Mc Brisola
158 | Mc Britney
159 | Mc Bruninha
160 | Mc Bruninho
161 | Mc Bruno IP
162 | Mc Bruxo
163 | Mc Buiu
164 | Mc Byana
165 | Mc CL
166 | Mc Cabelinho
167 | Mc Cabide
168 | Mc Cacau
169 | Mc Careca
170 | Mc Carioca
171 | Mc Carol
172 | Mc Caçula
173 | Mc Cebezinho
174 | Mc Cezareth
175 | Mc Chapo
176 | Mc Charles da Alemoa
177 | Mc Chavero
178 | Mc Chicão
179 | Mc Chiquinho
180 | Mc Choko
181 | Mc Clebinho
182 | Mc Colibri
183 | Mc Copinho
184 | Mc Coringa
185 | Mc Coringa Louco
186 | Mc CpK
187 | Mc Crash
188 | Mc Cris
189 | Mc Cristiane
190 | Mc Cruel
191 | Mc Créu Funk
192 | Mc DG
193 | Mc Dada Boladão
194 | Mc Dadinho e Diguinho
195 | Mc Daleste
196 | Mc Danado
197 | Mc Dani
198 | Mc Danilo Boladão
199 | Mc Danilo Zika
200 | Mc Davi
201 | Mc David Bolado
202 | Mc Decão
203 | Mc Dedé
204 | Mc Delano
205 | Mc Delley FD
206 | Mc Dentinho
207 | Mc Denny
208 | Mc Dido
209 | Mc Didô
210 | Mc Dieguinho
211 | Mc Diguinho
212 | Mc Digão
213 | Mc Dimenor Dr
214 | Mc Dingo
215 | Mc Dino
216 | Mc Discolado
217 | Mc Dodo 013
218 | Mc Dodô
219 | Mc Don Juan
220 | Mc Doriva
221 | Mc Douglinhas
222 | Mc Dudu
223 | Mc Duduzinho
224 | Mc Eller
225 | Mc Etiopia
226 | Mc Fabin da VL
227 | Mc Fabuloso
228 | Mc Fael
229 | Mc Falcon
230 | Mc Falcão
231 | Mc Farmá
232 | Mc Federado e os Leleks
233 | Mc Felipe Boladão
234 | Mc Felype
235 | Mc Filhão
236 | Mc Fininho
237 | Mc Fioti
238 | Mc Frank
239 | Mc G15
240 | Mc G3
241 | Mc G7
242 | Mc GB
243 | Mc Gaah e Mc BP
244 | Mc Galo
245 | Mc Gibi
246 | Mc Gil do Andaraí
247 | Mc God
248 | Mc Godô
249 | Mc Gringo
250 | Mc Guga da VG
251 | Mc Gui
252 | Mc Guimê
253 | Mc Guto
254 | Mc Gw
255 | Mc Hariel
256 | Mc Hollywood
257 | Mc Hudson 22
258 | Mc Huguinho
259 | Mc IG
260 | Mc Illana
261 | Mc Islaibe
262 | Mc Italo
263 | Mc J15
264 | Mc JG
265 | Mc Jadson Boladão
266 | Mc Jair da Rocha
267 | Mc Japa
268 | Mc Japa e Mc Japinha
269 | Mc Jean Paul
270 | Mc Jefinho
271 | Mc Jennifer
272 | Mc Jenny
273 | Mc Jerry
274 | Mc Jhey
275 | Mc John Marquês
276 | Mc Johnzinho
277 | Mc Jotta Pê
278 | Mc João
279 | Mc Joãozinho VT
280 | Mc Juninho
281 | Mc Juninho Jr
282 | Mc Júnior e Leonardo
283 | Mc K9
284 | Mc Kaká
285 | Mc Kapela
286 | Mc Kapela MK
287 | Mc Karyne da Provi
288 | Mc Katia
289 | Mc Kauan
290 | Mc Keke
291 | Mc Kekel
292 | Mc Kelvin
293 | Mc Kelvinho
294 | Mc Kevin
295 | Mc Kevinho
296 | Mc Kitinho
297 | Mc Koringa
298 | Mc Ks
299 | Mc LB
300 | Mc LBX
301 | Mc Lan
302 | Mc Lano
303 | Mc Lany
304 | Mc Lary Figueiredo
305 | Mc Laís
306 | Mc Leke
307 | Mc Leléto
308 | Mc Leoest
309 | Mc Leozinho
310 | Mc Leozinho do Recife
311 | Mc Lexx
312 | Mc Lipi
313 | Mc Lipivox
314 | Mc Livinho
315 | Mc Loirinha
316 | Mc Loma E As Gêmeas Lacração
317 | Mc Lon
318 | Mc Lone
319 | Mc Luan
320 | Mc Luan
321 | Mc Luciano Sp
322 | Mc Lukaz
323 | Mc Lustosa
324 | Mc Léo da Baixada
325 | Mc Mac Air
326 | Mc Magal
327 | Mc Magrinho
328 | Mc Maha
329 | Mc Maiquinho
330 | Mc Mallone
331 | Mc Maneirinho
332 | Mc Marcelly
333 | Mc Marcinho
334 | Mc Marcio Braz
335 | Mc Marcio G
336 | Mc Marina
337 | Mc Marks
338 | Mc Marlin
339 | Mc Maromba
340 | Mc Martinho
341 | Mc Mascote
342 | Mc Matheuzinho DK
343 | Mc Max
344 | Mc Mazinho
345 | Mc Melqui
346 | Mc Menassi
347 | Mc Menor
348 | Mc Menor da VG
349 | Mc Menorzinha
350 | Mc Menorzão
351 | Mc Metal e Cego
352 | Mc Milk
353 | Mc Mingau
354 | Mc Mirella
355 | Mc Misa
356 | Mc Ml
357 | Mc Mm
358 | Mc Moreno
359 | Mc Mágico
360 | Mc Márcio G
361 | Mc Mãozinha
362 | Mc Naldinho
363 | Mc Nandinho
364 | Mc Nany
365 | Mc Natacha
366 | Mc Nayara
367 | Mc Nego Bam
368 | Mc Nego Blue
369 | Mc Neguinho
370 | Mc Negão do Arizona
371 | Mc Neguinho do Kaxeta
372 | Mc Nem
373 | Mc Neném
374 | Mc Nice
375 | Mc Nobruh
376 | Mc North
377 | Mc Novinha
378 | Mc Novinho
379 | Mc Ombrinho
380 | Mc Orelha
381 | Mc PH
382 | Mc PP Da VS
383 | Mc PR
384 | Mc Pack Original
385 | Mc Papo
386 | Mc Patoroko
387 | Mc Patrick MF
388 | Mc Pedrinho
389 | Mc Pedrinho Jr
390 | Mc Pedrinho e Mc Léo da Baixada
391 | Mc Pekeno
392 | Mc Pelé
393 | Mc Pereira
394 | Mc Pet
395 | Mc Petter
396 | Mc Phe Cachorrera
397 | Mc Pierre
398 | Mc Pikachu
399 | Mc PikenaSK
400 | Mc Pingo
401 | Mc Pirata
402 | Mc Pivete
403 | Mc Pocahontas
404 | Mc Poiaka
405 | Mc Primo
406 | Mc Princesa e Plebeu
407 | Mc Pé de Pano
408 | Mc R1
409 | Mc Rael
410 | Mc Rael Souza
411 | Mc Rafa
412 | Mc Renan
413 | Mc Ricardinho
414 | Mc Ricardo
415 | Mc Richesse
416 | Mc Rick Lima
417 | Mc Rihanna da Baixada
418 | Mc Rita
419 | Mc Roba Cena
420 | Mc Robertinho
421 | Mc Robinho de Prata
422 | Mc Robs
423 | Mc Rodolfinho
424 | Mc Rodrigão
425 | Mc Rojai
426 | Mc Romeu
427 | Mc Ronny Khalifa
428 | Mc Rose
429 | Mc Rubby
430 | Mc Ruzika
431 | Mc Sabrina
432 | Mc Sabãozinho
433 | Mc Saed
434 | Mc Samuka e Nego
435 | Mc Sapão
436 | Mc Sargento
437 | Mc Savinon
438 | Mc Serginho
439 | Mc Serginho da VS
440 | Mc Sluck
441 | Mc Smith
442 | Mc Suave
443 | Mc Suellen
444 | Mc Sunda
445 | Mc Suzy
446 | Mc TH
447 | Mc TL
448 | Mc Tarapi
449 | Mc Tartaruga
450 | Mc Tati Zaqui
451 | Mc Tavinho JBK
452 | Mc Taz
453 | Mc Tchesko
454 | Mc Teco e Buzunga
455 | Mc Tevez
456 | Mc Thesko
457 | Mc Tiki
458 | Mc Tikão
459 | Mc Timbu
460 | Mc Tiozinho
461 | Mc Tom
462 | Mc Troia
463 | Mc Tupan
464 | Mc Uchoa
465 | Mc Vareta
466 | Mc Vine
467 | Mc Viné
468 | Mc Vitinho
469 | Mc Vitinho 2
470 | Mc Vitão
471 | Mc Vitêra
472 | Mc Vuk Vuk
473 | Mc Wc
474 | Mc Wendy
475 | Mc William-SP
476 | Mc Wm
477 | Mc Xlep
478 | Mc Yago
479 | Mc Yuri BH
480 | Mc Yuri BH
481 | Mc Zaac
482 | Mc Zedek
483 | Mc Zoi de Gato
484 | Mc Zuka
485 | Mc aw
486 | Mc kill
487 | McViktor
488 | Mcs BW
489 | Mcs Deco e Luco
490 | Mcs Gêmeos
491 | Mcs Jhowzinho e Kadinho
492 | Mcs Magrelo e Nenê
493 | Mcs Nenem e Magrão
494 | Mcs Samuka e Nego
495 | Mcs Zaac e Jerry
496 | Medrado
497 | Meik Of
498 | Menor
499 | Menor da Provi
500 | Menor do Chapa
501 | Menor do Chapa
502 | Mensageiros da Favela
503 | Michel Grasiani
504 | Mike
505 | Milennium Cia Show
506 | Miryan Martin
507 | Mr Catra
508 | Mr Poll
509 | Mr Pézão
510 | Mr. Fia
511 | Mr. Jamaica
512 | Mr. Mu
513 | Mulher Filé
514 | Mulher Gato
515 | Mulher Melancia
516 | Mulher Melão
517 | Mulher Moranguinho
518 | Márcio E Goró
519 | Márcio do Cacuia
520 | Mó-H
521 | Naldo
522 | Nanda Black
523 | Nanda Lynn
524 | Nanda Lyra
525 | Nathy Souto
526 | Nayara
527 | Nego do Borel
528 | Neguinho Do Caxeta
529 | Neguinho do Caxeta
530 | Negão do Arizona
531 | Nélio e Espiga
532 | Olliver
533 | Os Atraentes
534 | Os Atrevidos
535 | Os Avassaladores
536 | Os Bad Boys Funk
537 | Os Carrascos
538 | Os Caçadores
539 | Os Cretinos
540 | Os Danadinhos
541 | Os Don Juan
542 | Os Hawaianos
543 | Os Magrinhos
544 | Os Nerds
545 | Os Novinhos
546 | Os Ousados
547 | Os Polêmicos
548 | Os Sem Noção
549 | Os Vadios
550 | Os poETs
551 | Oz Muleke’s
552 | Oz Predadorez
553 | PH Lima
554 | Pablo Renato
555 | Pancadão do Caldeirão do Huck
556 | Paranga
557 | Perlla
558 | Pikeno
559 | Pikeno e Menor
560 | Priscila Nocetti
561 | Prud Rey
562 | Rafael Vidalles
563 | Raimundo Soldado
564 | Raphaella
565 | Renatinho e Alemão
566 | Robinho da Prata
567 | Rocket Pocket
568 | Rodney Dy
569 | Sabrina Boing Boing
570 | San Danado
571 | Saulo Matos
572 | Sd Boys
573 | Sharon Axé Moi
574 | Silvio e Robson
575 | Sol do Recanto
576 | Stronda 4 life
577 | Suspeita
578 | Tathi Kiss
579 | Tati Quebra Barraco
580 | Tiago Miller
581 | Valentes do funk
582 | Valesca Popozuda
583 | Vanessinha Pikatchu
584 | Verônica Costa
585 | Vine Rodry
586 | Vinicius e Andinho
587 | Vinny e Will
588 | 


--------------------------------------------------------------------------------
/format_data.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | 
  3 | from pathlib import Path
  4 | 
  5 | from preprocessing.dataset import MusicDataset
  6 | from preprocessing.tfrecord import SentenceTFRecord
  7 | from preprocessing.text_preprocessing import (get_vocabulary, create_word_dictionaties,
  8 |                                               save, replace_unk_words, replace_words_with_ids,
  9 |                                               create_labels, get_sizes_list, create_chunks)
 10 | 
 11 | 
 12 | DATA = 0
 13 | LABELS = 1
 14 | SIZES = 2
 15 | 
 16 | 
 17 | def save_dataset(dataset_all, dataset_type, dataset_save_path):
 18 |     save_path = dataset_save_path / dataset_type
 19 | 
 20 |     if not save_path.is_dir():
 21 |         save_path.mkdir()
 22 | 
 23 |     data = dataset_all[DATA]
 24 |     data_save_path = save_path / str(dataset_type + '_data.pkl')
 25 |     save(data, data_save_path)
 26 | 
 27 |     labels = dataset_all[LABELS]
 28 |     labels_save_path = save_path / str(dataset_type + '_labels.pkl')
 29 |     save(labels, labels_save_path)
 30 | 
 31 |     sizes = dataset_all[SIZES]
 32 |     sizes_save_path = save_path / str(dataset_type + '_sizes.pkl')
 33 |     save(sizes, sizes_save_path)
 34 | 
 35 | 
 36 | def create_tfrecord(dataset, dataset_type, dataset_save_path):
 37 |     save_path = dataset_save_path / dataset_type / str(dataset_type + '.tfrecord')
 38 |     sentence_tfrecord = SentenceTFRecord(dataset, str(save_path))
 39 |     sentence_tfrecord.parse_sentences()
 40 | 
 41 | 
 42 | def full_preprocessing(train, validation, test, data_folder,
 43 |                        dataset_save_path, min_frequency):
 44 | 
 45 |     data_folder = Path(data_folder)
 46 | 
 47 |     print('Creating vocabulary ...')
 48 |     vocabulary = get_vocabulary(train, min_frequency)
 49 |     print('Vocabulary lenght: {}'.format(len(vocabulary)))
 50 |     word2index, index2word = create_word_dictionaties(vocabulary)
 51 | 
 52 |     index2word_path = data_folder / dataset_save_path / 'index2word.pkl'
 53 |     save(index2word, index2word_path)
 54 |     word2index_path = data_folder / dataset_save_path / 'word2index.pkl'
 55 |     save(word2index, word2index_path)
 56 | 
 57 |     print('Replacing unknown words ...')
 58 |     replace_unk_words(train, word2index)
 59 |     replace_unk_words(validation, word2index)
 60 |     replace_unk_words(test, word2index)
 61 | 
 62 |     print('Turning words into word ids ...')
 63 |     train = replace_words_with_ids(train, word2index)
 64 |     validation = replace_words_with_ids(validation, word2index)
 65 |     test = replace_words_with_ids(test, word2index)
 66 | 
 67 |     print('Creating chunks ...')
 68 |     train = create_chunks(train, chunk_max_size=35)
 69 |     validation = create_chunks(validation, chunk_max_size=35)
 70 |     test = create_chunks(test, chunk_max_size=35)
 71 | 
 72 |     print('Creating labels ...')
 73 |     train, train_labels = create_labels(train)
 74 |     validation, validation_labels = create_labels(validation)
 75 |     test, test_labels = create_labels(test)
 76 | 
 77 |     print('Creating size list ...')
 78 |     train_sizes = get_sizes_list(train)
 79 |     validation_sizes = get_sizes_list(validation)
 80 |     test_sizes = get_sizes_list(test)
 81 | 
 82 |     train_all = (train, train_labels, train_sizes)
 83 |     validation_all = (validation, validation_labels, validation_sizes)
 84 |     test_all = (test, test_labels, test_sizes)
 85 | 
 86 |     dataset_save_path = data_folder / dataset_save_path
 87 | 
 88 |     print('Saving datasets ...')
 89 |     save_dataset(train_all, 'train', dataset_save_path)
 90 |     save_dataset(validation_all, 'validation', dataset_save_path)
 91 |     save_dataset(test_all, 'test', dataset_save_path)
 92 | 
 93 |     print('Creating tfrecords ...')
 94 |     create_tfrecord(train_all, 'train', dataset_save_path)
 95 |     create_tfrecord(validation_all, 'validation', dataset_save_path)
 96 |     create_tfrecord(test_all, 'test', dataset_save_path)
 97 | 
 98 | 
 99 | def create_argparse():
100 |     parser = argparse.ArgumentParser()
101 | 
102 |     parser.add_argument('-df',
103 |                         '--data-folder',
104 |                         type=str,
105 |                         help='Location of the songs files')
106 | 
107 |     parser.add_argument('-dsp',
108 |                         '--dataset-save-path',
109 |                         type=str,
110 |                         help='Location to save the dataset files')
111 | 
112 |     parser.add_argument('-mf',
113 |                         '--min-frequency',
114 |                         type=int,
115 |                         help='Minimum word frequency required for a word to a part of the vocabulary')  # noqa
116 | 
117 |     parser.add_argument('-vp',
118 |                         '--validation-percent',
119 |                         type=float,
120 |                         help='Percent of train dataset to use for validation')
121 | 
122 |     parser.add_argument('-tp',
123 |                         '--test-percent',
124 |                         type=float,
125 |                         help='Percent of train dataset to use for test')
126 | 
127 |     return parser
128 | 
129 | 
130 | def main():
131 |     parser = create_argparse()
132 |     user_args = vars(parser.parse_args())
133 | 
134 |     data_folder = user_args['data_folder']
135 |     dataset_save_path = Path(user_args['dataset_save_path'])
136 |     music_dataset = MusicDataset(data_folder, dataset_save_path)
137 | 
138 |     validation_percent = user_args['validation_percent']
139 |     test_percent = user_args['test_percent']
140 |     music_dataset.create_dataset(
141 |         validation_percent=validation_percent, test_percent=test_percent)
142 |     music_dataset.display_info()
143 | 
144 |     train_dataset = music_dataset.train_dataset
145 |     validation_dataset = music_dataset.validation_dataset
146 |     test_dataset = music_dataset.test_dataset
147 |     min_frequency = user_args['min_frequency']
148 | 
149 |     full_preprocessing(
150 |         train=train_dataset,
151 |         validation=validation_dataset,
152 |         test=test_dataset,
153 |         data_folder=data_folder,
154 |         dataset_save_path=dataset_save_path,
155 |         min_frequency=min_frequency)
156 | 
157 | 
158 | if __name__ == '__main__':
159 |     main()
160 | 


--------------------------------------------------------------------------------
/masks/romano.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lucasmoura/funk_generator/7b3b978d25ea8731cf37e519a41c59116f1fc0c6/masks/romano.png


--------------------------------------------------------------------------------
/model.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import os
  3 | 
  4 | from model.input_pipeline import InputPipeline
  5 | from model.rnn import RecurrentModel, RecurrentConfig
  6 | from model.song_generator import GreedySongGenerator
  7 | from utils.session_manager import initialize_session
  8 | 
  9 | 
 10 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
 11 | 
 12 | 
 13 | def create_argparse():
 14 |     argument_parser = argparse.ArgumentParser()
 15 | 
 16 |     argument_parser.add_argument('-tf',
 17 |                                  '--train-file',
 18 |                                  type=str,
 19 |                                  help='Location of the training file')
 20 | 
 21 |     argument_parser.add_argument('-vf',
 22 |                                  '--validation-file',
 23 |                                  type=str,
 24 |                                  help='Location of the validation file')
 25 | 
 26 |     argument_parser.add_argument('-tef',
 27 |                                  '--test-file',
 28 |                                  type=str,
 29 |                                  help='Location of the test file')
 30 | 
 31 |     argument_parser.add_argument('-chp',
 32 |                                  '--checkpoint-path',
 33 |                                  type=str,
 34 |                                  help="The path to save model's checkpoint")
 35 | 
 36 |     argument_parser.add_argument('-uch',
 37 |                                  '--use-checkpoint',
 38 |                                  type=int,
 39 |                                  help='If the model checkpoint should be loaded')
 40 | 
 41 |     argument_parser.add_argument('-i2w',
 42 |                                  '--index2word-path',
 43 |                                  type=str,
 44 |                                  help='Location of the index2word dict')
 45 | 
 46 |     argument_parser.add_argument('-w2i',
 47 |                                  '--word2index-path',
 48 |                                  type=str,
 49 |                                  help='Location of word2index dict')
 50 | 
 51 |     argument_parser.add_argument('-ne',
 52 |                                  '--num-epochs',
 53 |                                  type=int,
 54 |                                  help='Number of epochs to train')
 55 | 
 56 |     argument_parser.add_argument('-bs',
 57 |                                  '--batch-size',
 58 |                                  type=int,
 59 |                                  help='Batch size to use in the model')
 60 | 
 61 |     argument_parser.add_argument('-lr',
 62 |                                  '--learning-rate',
 63 |                                  type=float,
 64 |                                  help='Learning rate to use when training')
 65 | 
 66 |     argument_parser.add_argument('-nl',
 67 |                                  '--num-layers',
 68 |                                  type=int,
 69 |                                  help='Number of lstm layers to use')
 70 | 
 71 |     argument_parser.add_argument('-nu',
 72 |                                  '--num-units',
 73 |                                  type=int,
 74 |                                  help='Number of units to use in the lstm cell')
 75 | 
 76 |     argument_parser.add_argument('-vs',
 77 |                                  '--vocab-size',
 78 |                                  type=int,
 79 |                                  help='Size of the vocabulary')
 80 | 
 81 |     argument_parser.add_argument('-es',
 82 |                                  '--embedding-size',
 83 |                                  type=int,
 84 |                                  help='Dimension of the embedding matrix')
 85 | 
 86 |     argument_parser.add_argument('-ed',
 87 |                                  '--embedding-dropout',
 88 |                                  type=float,
 89 |                                  help='Embedding dropout')
 90 | 
 91 |     argument_parser.add_argument('-lod',
 92 |                                  '--lstm-output-dropout',
 93 |                                  type=float,
 94 |                                  help='LSTM output dropout')
 95 | 
 96 |     argument_parser.add_argument('-lid',
 97 |                                  '--lstm-input-dropout',
 98 |                                  type=float,
 99 |                                  help='LSTM input dropout')
100 | 
101 |     argument_parser.add_argument('-lsd',
102 |                                  '--lstm-state-dropout',
103 |                                  type=float,
104 |                                  help='LSTM state dropout')
105 | 
106 |     argument_parser.add_argument('-wd',
107 |                                  '--weight-decay',
108 |                                  type=float,
109 |                                  help='Weight decay')
110 | 
111 |     argument_parser.add_argument('-minv',
112 |                                  '--min-val',
113 |                                  type=int,
114 |                                  help='Min value to use when initializing weights')
115 | 
116 |     argument_parser.add_argument('-maxv',
117 |                                  '--max-val',
118 |                                  type=int,
119 |                                  help='Max value to use when initializing weights')
120 | 
121 |     argument_parser.add_argument('-nbc',
122 |                                  '--num-buckets',
123 |                                  type=int,
124 |                                  help='Number of buckets to use')
125 | 
126 |     argument_parser.add_argument('-bcw',
127 |                                  '--bucket-width',
128 |                                  type=int,
129 |                                  help='Number of elements allowed in bucket')
130 | 
131 |     argument_parser.add_argument('-pb',
132 |                                  '--prefetch-buffer',
133 |                                  type=int,
134 |                                  help='Size of prefetch buffer')
135 | 
136 |     argument_parser.add_argument('-pf',
137 |                                  '--perform-shuffle',
138 |                                  type=int,
139 |                                  help='If we shoudl shuffle the batches when training the model')
140 | 
141 |     return argument_parser
142 | 
143 | 
144 | def main():
145 |     argument_parser = create_argparse()
146 |     user_args = vars(argument_parser.parse_args())
147 | 
148 |     train_file = user_args['train_file']
149 |     validation_file = user_args['validation_file']
150 |     test_file = user_args['test_file']
151 |     batch_size = user_args['batch_size']
152 |     num_buckets = user_args['num_buckets']
153 |     bucket_width = user_args['bucket_width']
154 |     prefetch_buffer = user_args['prefetch_buffer']
155 |     perform_shuffle = True if user_args['perform_shuffle'] == 1 else False
156 | 
157 |     dataset = InputPipeline(
158 |         train_files=train_file,
159 |         validation_files=validation_file,
160 |         test_files=test_file,
161 |         batch_size=batch_size,
162 |         perform_shuffle=perform_shuffle,
163 |         bucket_width=bucket_width,
164 |         num_buckets=num_buckets,
165 |         prefetch_buffer=prefetch_buffer)
166 | 
167 |     dataset.build_pipeline()
168 | 
169 |     config = RecurrentConfig(user_args)
170 |     model = RecurrentModel(dataset, config)
171 |     model.build_graph()
172 | 
173 |     with initialize_session(config) as (sess, saver):
174 |         model.fit(sess, saver)
175 | 
176 |         generator = GreedySongGenerator(model)
177 |         print('Generating song (Greedy) ...')
178 |         generator.generate(sess)
179 | 
180 | 
181 | if __name__ == '__main__':
182 |     main()
183 | 


--------------------------------------------------------------------------------
/model/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lucasmoura/funk_generator/7b3b978d25ea8731cf37e519a41c59116f1fc0c6/model/__init__.py


--------------------------------------------------------------------------------
/model/input_pipeline.py:
--------------------------------------------------------------------------------
  1 | import tensorflow as tf
  2 | 
  3 | 
  4 | class SongDataset:
  5 | 
  6 |     def __init__(self, data, batch_size, perform_shuffle,
  7 |                  bucket_width, num_buckets, prefetch_buffer):
  8 |         self.data = data
  9 |         self.batch_size = batch_size
 10 |         self.perform_shuffle = perform_shuffle
 11 |         self.bucket_width = bucket_width
 12 |         self.num_buckets = num_buckets
 13 |         self.prefetch_buffer = prefetch_buffer
 14 | 
 15 |     def parser(self, tfrecord):
 16 |         context_features = {
 17 |             'size': tf.FixedLenFeature([], dtype=tf.int64),
 18 |         }
 19 | 
 20 |         sequence_features = {
 21 |             'tokens': tf.FixedLenSequenceFeature([], dtype=tf.int64),
 22 |             'labels': tf.FixedLenSequenceFeature([], dtype=tf.int64)
 23 |         }
 24 | 
 25 |         tfrecord_parsed = tf.parse_single_sequence_example(
 26 |             tfrecord, context_features, sequence_features)
 27 | 
 28 |         tokens = tfrecord_parsed[1]['tokens']
 29 |         labels = tfrecord_parsed[1]['labels']
 30 |         size = tfrecord_parsed[0]['size']
 31 | 
 32 |         return tokens, labels, size
 33 | 
 34 |     def init_dataset(self):
 35 |         song_dataset = tf.data.TFRecordDataset(self.data)
 36 |         song_dataset = song_dataset.cache()
 37 |         song_dataset = song_dataset.map(self.parser, num_parallel_calls=8)
 38 | 
 39 |         return song_dataset
 40 | 
 41 |     def create_dataset(self):
 42 |         song_dataset = self.init_dataset()
 43 | 
 44 |         if self.perform_shuffle:
 45 |             song_dataset = song_dataset.shuffle(buffer_size=self.batch_size * 2)
 46 | 
 47 |         def batching_func(dataset):
 48 |             return dataset.padded_batch(
 49 |                     self.batch_size,
 50 |                     padded_shapes=(
 51 |                         tf.TensorShape([None]),  # token
 52 |                         tf.TensorShape([None]),  # label
 53 |                         tf.TensorShape([]))  # size
 54 |                     )
 55 | 
 56 |         def key_func(tokens, labels, size):
 57 |             bucket_id = size // self.bucket_width
 58 | 
 59 |             return tf.to_int64(tf.minimum(bucket_id, self.num_buckets))
 60 | 
 61 |         def reduce_func(bucket_key, widowed_data):
 62 |             return batching_func(widowed_data)
 63 | 
 64 |         song_dataset = song_dataset.apply(
 65 |             tf.contrib.data.group_by_window(
 66 |                 key_func=key_func, reduce_func=reduce_func, window_size=self.batch_size))
 67 | 
 68 |         self.song_dataset = song_dataset.prefetch(self.prefetch_buffer)
 69 | 
 70 |         return self.song_dataset
 71 | 
 72 | 
 73 | class InputPipeline:
 74 | 
 75 |     def __init__(self, train_files, validation_files, test_files, batch_size, perform_shuffle,
 76 |                  bucket_width, num_buckets, prefetch_buffer):
 77 |         self.train_files = train_files
 78 |         self.validation_files = validation_files
 79 |         self.test_files = test_files
 80 |         self.batch_size = batch_size
 81 |         self.perform_shuffle = perform_shuffle
 82 |         self.bucket_width = bucket_width
 83 |         self.num_buckets = num_buckets
 84 |         self.prefetch_buffer = prefetch_buffer
 85 | 
 86 |         self._train_iterator_op = None
 87 |         self._validation_iterator_op = None
 88 |         self._test_iterator_op = None
 89 | 
 90 |     @property
 91 |     def train_iterator(self):
 92 |         return self._train_iterator_op
 93 | 
 94 |     @property
 95 |     def validation_iterator(self):
 96 |         return self._validation_iterator_op
 97 | 
 98 |     @property
 99 |     def test_iterator(self):
100 |         return self._test_iterator_op
101 | 
102 |     def create_datasets(self, dataset=SongDataset):
103 |         train_dataset = dataset(
104 |             self.train_files, self.batch_size, self.perform_shuffle,
105 |             self.bucket_width, self.num_buckets, self.prefetch_buffer)
106 |         validation_dataset = dataset(
107 |             self.validation_files, self.batch_size, False,
108 |             self.bucket_width, self.num_buckets, self.prefetch_buffer)
109 |         test_dataset = dataset(
110 |             self.test_files, self.batch_size, False,
111 |             self.bucket_width, self.num_buckets, self.prefetch_buffer)
112 | 
113 |         self.train_dataset = train_dataset.create_dataset()
114 |         self.validation_dataset = validation_dataset.create_dataset()
115 |         self.test_dataset = test_dataset.create_dataset()
116 | 
117 |     def create_iterator(self):
118 |         self._train_iterator_op = self.train_dataset.make_initializable_iterator()
119 |         self._validation_iterator_op = self.validation_dataset.make_initializable_iterator()
120 |         self._test_iterator_op = self.test_dataset.make_initializable_iterator()
121 | 
122 |     def get_num_batches(self, iterator):
123 |         with tf.Session() as sess:
124 |             num_batches = 0
125 |             sess.run(iterator.initializer)
126 | 
127 |             while True:
128 |                 try:
129 |                     _, _, _ = sess.run(iterator.get_next())
130 |                     num_batches += 1
131 |                 except tf.errors.OutOfRangeError:
132 |                     break
133 | 
134 |         return num_batches
135 | 
136 |     def get_datasets_num_batches(self):
137 |         self.train_batches = self.get_num_batches(self.train_iterator)
138 |         self.validation_batches = self.get_num_batches(self.validation_iterator)
139 |         self.test_batches = self.get_num_batches(self.test_iterator)
140 | 
141 |     def build_pipeline(self):
142 |         self.create_datasets()
143 |         self.create_iterator()
144 | 


--------------------------------------------------------------------------------
/model/rnn.py:
--------------------------------------------------------------------------------
  1 | import tensorflow as tf
  2 | 
  3 | 
  4 | from model.song_model import ModelConfig, SongLyricsModel
  5 | 
  6 | 
  7 | class RecurrentConfig(ModelConfig):
  8 | 
  9 |     def __init__(self, model_params):
 10 |         super().__init__(model_params)
 11 | 
 12 |         self.num_layers = model_params['num_layers']
 13 |         self.num_units = model_params['num_units']
 14 |         self.embedding_dropout = model_params['embedding_dropout']
 15 |         self.lstm_state_dropout = model_params['lstm_state_dropout']
 16 |         self.lstm_input_dropout = model_params['lstm_input_dropout']
 17 |         self.lstm_output_dropout = model_params['lstm_output_dropout']
 18 |         self.weight_decay = model_params['weight_decay']
 19 |         self.min_val = model_params['min_val']
 20 |         self.max_val = model_params['max_val']
 21 | 
 22 | 
 23 | class RecurrentModel(SongLyricsModel):
 24 | 
 25 |     def add_placeholders_op(self):
 26 |         self.embedding_dropout_placeholder = tf.placeholder(
 27 |             tf.float32, name='embedding_dropout')
 28 |         self.lstm_state_dropout_placeholder = tf.placeholder(
 29 |             tf.float32, name='lstm_state_dropout')
 30 |         self.lstm_input_dropout_placeholder = tf.placeholder(
 31 |             tf.float32, name='lstm_input_dropout')
 32 |         self.lstm_output_dropout_placeholder = tf.placeholder(
 33 |             tf.float32, name='lstm_output_dropout')
 34 | 
 35 |     def create_train_feed_dict(self):
 36 |         feed_dict = {
 37 |             self.embedding_dropout_placeholder: self.config.embedding_dropout,
 38 |             self.lstm_state_dropout_placeholder: self.config.lstm_state_dropout,
 39 |             self.lstm_input_dropout_placeholder: self.config.lstm_input_dropout,
 40 |             self.lstm_output_dropout_placeholder: self.config.lstm_output_dropout,
 41 |         }
 42 | 
 43 |         return feed_dict
 44 | 
 45 |     def create_validation_feed_dict(self):
 46 |         feed_dict = {
 47 |             self.embedding_dropout_placeholder: 1.0,
 48 |             self.lstm_state_dropout_placeholder: 1.0,
 49 |             self.lstm_input_dropout_placeholder: 1.0,
 50 |             self.lstm_output_dropout_placeholder: 1.0,
 51 |         }
 52 | 
 53 |         return feed_dict
 54 | 
 55 |     def create_generate_feed_dict(self, data, temperature, state):
 56 |         feed_dict = self.create_validation_feed_dict()
 57 | 
 58 |         feed_dict[self.data_placeholder] = data
 59 |         feed_dict[self.temperature_placeholder] = temperature
 60 |         feed_dict[self.initial_state] = state
 61 | 
 62 |         return feed_dict
 63 | 
 64 |     def add_embeddings_op(self, data_batch):
 65 |         with tf.name_scope('embeddings'):
 66 |             self.embeddings = tf.get_variable(
 67 |                 'embeddings',
 68 |                 initializer=tf.random_uniform_initializer(
 69 |                     minval=self.config.min_val, maxval=self.config.max_val),
 70 |                 shape=(self.config.vocab_size, self.config.embedding_size),
 71 |                 dtype=tf.float32
 72 |             )
 73 | 
 74 |             self.embeddings_dropout = tf.nn.dropout(
 75 |                 self.embeddings, keep_prob=self.embedding_dropout_placeholder)
 76 | 
 77 |             inputs = tf.nn.embedding_lookup(
 78 |                 self.embeddings_dropout, data_batch)
 79 | 
 80 |         return inputs
 81 | 
 82 |     def add_logits_op(self, data_batch, size_batch, reuse=False):
 83 |         with tf.variable_scope('logits', reuse=reuse):
 84 |             data_embeddings = self.add_embeddings_op(data_batch)
 85 | 
 86 |             with tf.name_scope('recurrent_layer'):
 87 |                 def make_cell(input_size):
 88 |                     lstm_cell = tf.nn.rnn_cell.LSTMCell(
 89 |                         self.config.num_units)
 90 |                     drop_cell = tf.nn.rnn_cell.DropoutWrapper(
 91 |                         lstm_cell,
 92 |                         state_keep_prob=self.lstm_state_dropout_placeholder,
 93 |                         output_keep_prob=self.lstm_output_dropout_placeholder,
 94 |                         variational_recurrent=True,
 95 |                         input_size=input_size,
 96 |                         dtype=tf.float32)
 97 | 
 98 |                     return drop_cell
 99 | 
100 |                 input_sizes = [
101 |                     self.config.embedding_size, self.config.num_units, self.config.num_units
102 |                 ]
103 |                 self.cell = tf.nn.rnn_cell.MultiRNNCell(
104 |                     [make_cell(input_sizes[i]) for i in range(self.config.num_layers)])
105 | 
106 |                 self.initial_state = self.cell.zero_state(
107 |                     tf.shape(data_batch)[0], tf.float32)
108 | 
109 |                 outputs, final_state = tf.nn.dynamic_rnn(
110 |                     self.cell,
111 |                     data_embeddings,
112 |                     sequence_length=size_batch,
113 |                     initial_state=self.initial_state,
114 |                     dtype=tf.float32
115 |                 )
116 | 
117 |             with tf.name_scope('logits'):
118 |                 flat_outputs = tf.reshape(outputs, [-1, self.config.num_units])
119 | 
120 |                 weights = tf.get_variable(
121 |                     'weights',
122 |                     initializer=tf.contrib.layers.xavier_initializer(),
123 |                     shape=(self.config.num_units, self.config.embedding_size),
124 |                     dtype=tf.float32)
125 | 
126 |                 bias = tf.get_variable(
127 |                     'bias',
128 |                     initializer=tf.ones_initializer(),
129 |                     shape=(self.config.embedding_size),
130 |                     dtype=tf.float32)
131 | 
132 |                 flat_inputs = tf.matmul(
133 |                     flat_outputs, weights) + bias
134 | 
135 |                 bias_logits = tf.get_variable(
136 |                     'bias_logits',
137 |                     initializer=tf.ones_initializer(),
138 |                     shape=(self.config.vocab_size),
139 |                     dtype=tf.float32)
140 | 
141 |                 flat_logits = tf.matmul(
142 |                     flat_inputs, tf.transpose(self.embeddings)) + bias_logits
143 | 
144 |                 batch_size = tf.shape(data_batch)[0]
145 |                 max_len = tf.shape(data_batch)[1]
146 | 
147 |                 logits = tf.reshape(
148 |                     flat_logits, [batch_size, max_len, self.config.vocab_size])
149 | 
150 |             return logits, final_state
151 | 
152 |     def add_loss_op(self, logits, labels_batch, size_batch):
153 |         with tf.name_scope('loss'):
154 |             weights = tf.sequence_mask(size_batch, dtype=tf.float32)
155 | 
156 |             seq_loss = tf.contrib.seq2seq.sequence_loss(
157 |                 logits=logits,
158 |                 targets=labels_batch,
159 |                 weights=weights
160 |             )
161 | 
162 |             loss = tf.reduce_sum(seq_loss)
163 | 
164 |         return loss
165 | 
166 |     def add_l2_regularizer_op(self, loss):
167 |         l2_loss = self.config.weight_decay * tf.add_n(
168 |             [tf.nn.l2_loss(v) for v in tf.trainable_variables()])
169 | 
170 |         return loss + l2_loss
171 | 
172 |     def add_train_op(self, loss):
173 |         optimizer = tf.train.AdamOptimizer(learning_rate=self.config.learning_rate)
174 |         optimizer_op = optimizer.minimize(loss)
175 | 
176 |         return optimizer_op
177 | 


--------------------------------------------------------------------------------
/model/sample_generator.py:
--------------------------------------------------------------------------------
 1 | from model.rnn import RecurrentModel, RecurrentConfig
 2 | from model.song_generator import GreedySongGenerator
 3 | 
 4 | import tensorflow as tf
 5 | 
 6 | 
 7 | def create_sample(model_config, html=False):
 8 |     sample_config = RecurrentConfig(model_config)
 9 |     graph = tf.Graph()
10 | 
11 |     with graph.as_default():
12 |         model = RecurrentModel(None, sample_config)
13 |         model.build_placeholders()
14 |         model.build_generate_graph(reuse=False)
15 | 
16 |         config = tf.ConfigProto(device_count={'GPU': 0})
17 |         sess = tf.Session(config=config)
18 | 
19 |         checkpoint = tf.train.latest_checkpoint(sample_config.checkpoint_path)
20 |         saver = tf.train.Saver()
21 |         saver.restore(sess, checkpoint)
22 |         generator = GreedySongGenerator(model)
23 | 
24 |     def sample(prime_words, html):
25 |         prime_words = prime_words.split()
26 |         with graph.as_default():
27 |             return generator.generate(sess, prime_words=prime_words, html=html)
28 | 
29 |     return sample
30 | 


--------------------------------------------------------------------------------
/model/song_generator.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import tensorflow as tf
  3 | 
  4 | 
  5 | class GreedySongGenerator:
  6 | 
  7 |     def __init__(self, model):
  8 |         self.model = model
  9 | 
 10 |     def parse_song(self, song_list, html):
 11 |         parsed_song = []
 12 |         is_mc = False
 13 | 
 14 |         for index in range(len(song_list)):
 15 |             curr_word = song_list[index]
 16 | 
 17 |             if index == 0:
 18 |                 parsed_song.append(curr_word)
 19 |                 continue
 20 | 
 21 |             if curr_word[0].isupper():
 22 |                 if not html:
 23 |                     parsed_song.append('\n')
 24 |                 else:
 25 |                     parsed_song.append('<br>')
 26 | 
 27 |             if is_mc:
 28 |                 parsed_song.append('Neural')
 29 |                 is_mc = False
 30 |             else:
 31 |                 parsed_song.append(curr_word)
 32 | 
 33 |             if curr_word.lower() == 'mc':
 34 |                 is_mc = True
 35 | 
 36 |         return ' '.join(parsed_song)
 37 | 
 38 |     def weighted_pick(self, weights):
 39 |         t = np.cumsum(weights)
 40 |         s = np.sum(weights)
 41 |         return(int(np.searchsorted(t, np.random.rand(1)*s)))
 42 | 
 43 |     def create_initial_state(self, sess):
 44 |         state = sess.run(self.model.cell.zero_state(1, tf.float32))
 45 |         word = self.model.word2index['<begin>']
 46 | 
 47 |         return state, word
 48 | 
 49 |     def create_prime_state(self, sess, prime_words, temperature):
 50 |         state = sess.run(self.model.cell.zero_state(1, tf.float32))
 51 | 
 52 |         probs = None
 53 |         for word in prime_words:
 54 |             id_word = self.model.word2index.get(word, -1)
 55 | 
 56 |             if id_word == -1:
 57 |                 continue
 58 | 
 59 |             probs, state = self.model.predict(sess, state, id_word, temperature)
 60 | 
 61 |         if probs is not None:
 62 |             while True:
 63 |                 generated_word_id = self.weighted_pick(probs)
 64 | 
 65 |                 if generated_word_id != 1:
 66 |                     break
 67 |         else:
 68 |             return self.create_initial_state(sess)
 69 | 
 70 |         return state, generated_word_id
 71 | 
 72 |     def generate(self, sess, prime_words=None, temperature=0.7, num_out=200, html=False):
 73 |         song = []
 74 |         current_word = "<UNK>"
 75 |         repetition_counter = 0
 76 |         unk_count = 0
 77 | 
 78 |         sequences = []
 79 |         sequence = ""
 80 |         restart = False
 81 | 
 82 |         if prime_words:
 83 |             state, word = self.create_prime_state(sess, prime_words, temperature)
 84 |             song.extend(prime_words)
 85 |         else:
 86 |             state, word = self.create_initial_state(sess)
 87 | 
 88 |         for i in range(num_out):
 89 |             probs, state = self.model.predict(sess, state, word, temperature)
 90 |             probs = probs[0].reshape(-1)
 91 | 
 92 |             while True:
 93 |                 generated_word_id = self.weighted_pick(probs)
 94 |                 generated_word = str(self.model.index2word.get(generated_word_id, 1))
 95 | 
 96 |                 if generated_word == '<UNK>':
 97 |                     unk_count += 0
 98 | 
 99 |                     if unk_count >= 150:
100 |                         return -1
101 | 
102 |                     continue
103 | 
104 |                 if generated_word == '<end>' and len(song) < 100:
105 |                     continue
106 | 
107 |                 if generated_word.lower() != current_word.lower():
108 |                     current_word = generated_word
109 |                     repetition_counter = 0
110 |                 elif current_word != '<UNK>':
111 |                     repetition_counter += 1
112 | 
113 |                 if repetition_counter >= 5:
114 |                     if repetition_counter >= 100:
115 |                         return -1
116 |                     continue
117 | 
118 |                 if generated_word != '<UNK>':
119 |                     unk_count = 0
120 |                     break
121 | 
122 |             word = generated_word_id
123 | 
124 |             if generated_word[0].isupper():
125 | 
126 |                 if sequences.count(sequence) >= 3:
127 |                     state, word = self.create_initial_state(sess)
128 |                     restart = True
129 | 
130 |                 sequences.append(sequence)
131 |                 sequence = generated_word
132 | 
133 |                 if restart:
134 |                     restart = False
135 |                     continue
136 | 
137 |             else:
138 |                 sequence += generated_word
139 | 
140 |             if generated_word == '<end>':
141 |                 break
142 | 
143 |             song.append(str(generated_word))
144 | 
145 |         return self.parse_song(song, html)
146 | 


--------------------------------------------------------------------------------
/model/song_model.py:
--------------------------------------------------------------------------------
  1 | import pickle
  2 | import random
  3 | import os
  4 | 
  5 | import numpy as np
  6 | import tensorflow as tf
  7 | 
  8 | 
  9 | class ModelConfig:
 10 | 
 11 |     def __init__(self, model_params):
 12 |         self.vocab_size = model_params['vocab_size']
 13 |         self.embedding_size = model_params['embedding_size']
 14 |         self.learning_rate = model_params['learning_rate']
 15 |         self.num_epochs = model_params['num_epochs']
 16 |         self.use_checkpoint = model_params['use_checkpoint']
 17 |         self.index2word_path = model_params['index2word_path']
 18 |         self.word2index_path = model_params['word2index_path']
 19 |         self.checkpoint_path = model_params['checkpoint_path']
 20 | 
 21 | 
 22 | class SongLyricsModel:
 23 | 
 24 |     def __init__(self, dataset, config):
 25 |         self.dataset = dataset
 26 |         self.config = config
 27 | 
 28 |         self.index2word = self.load_dict(self.config.index2word_path)
 29 |         self.word2index = self.load_dict(self.config.word2index_path)
 30 | 
 31 |     def load_dict(self, path):
 32 |         with open(path, 'rb') as f:
 33 |             return pickle.load(f)
 34 | 
 35 |     def add_embedding_op(self, data_batch):
 36 |         raise NotImplementedError
 37 | 
 38 |     def add_logits_op(self, data_batch, size_batch, reuse=False):
 39 |         raise NotImplementedError
 40 | 
 41 |     def add_loss_op(self, logits, labels_batch, size_batch):
 42 |         raise NotImplementedError
 43 | 
 44 |     def add_train_op(self, loss):
 45 |         raise NotImplementedError
 46 | 
 47 |     def run_epoch(self, sess, ops, feed_dict, training=True):
 48 |         costs = 0
 49 |         num_iters = 0
 50 |         state = None
 51 | 
 52 |         while True:
 53 |             try:
 54 | 
 55 |                 if random.random() >= 0.01 and state is not None:
 56 |                     feed_dict[self.initial_state] = state
 57 | 
 58 |                 _, batch_loss, state = sess.run(ops, feed_dict=feed_dict)
 59 | 
 60 |                 costs += batch_loss
 61 |                 num_iters += 1
 62 | 
 63 |             except tf.errors.OutOfRangeError:
 64 |                 return np.exp(costs / num_iters)
 65 | 
 66 |     def fit(self, sess, saver):
 67 |         best_perplexity = 10000000
 68 |         train_feed_dict = self.create_train_feed_dict()
 69 | 
 70 |         for i in range(self.config.num_epochs):
 71 |             print('Running epoch: {}'.format(i + 1))
 72 |             ops = [self.train_op, self.train_loss, self.train_state]
 73 |             sess.run(self.train_iterator.initializer)
 74 |             train_perplexity = self.run_epoch(sess, ops, train_feed_dict)
 75 |             print('Train perplexity: {:.3f}'.format(train_perplexity))
 76 | 
 77 |             if train_perplexity < best_perplexity:
 78 |                 best_perplexity = train_perplexity
 79 |                 print('New best perplexity found !  {:.3f}'.format(best_perplexity))
 80 | 
 81 |                 saver.save(
 82 |                     sess, os.path.join(self.config.checkpoint_path, 'song_model.ckpt'))
 83 | 
 84 |     def predict(self, sess, state, word, temperature):
 85 |         input_word_id = np.array([[word]])
 86 |         feed_dict = self.create_generate_feed_dict(
 87 |             data=input_word_id,
 88 |             temperature=temperature,
 89 |             state=state)
 90 | 
 91 |         probs, state = sess.run(
 92 |             [self.generate_predictions, self.generate_state], feed_dict=feed_dict)
 93 | 
 94 |         return probs, state
 95 | 
 96 |     def build_placeholders(self):
 97 |         with tf.name_scope('placeholders'):
 98 |             self.add_placeholders_op()
 99 | 
100 |             self.data_placeholder = tf.placeholder(tf.int32, [None, 1])
101 |             self.temperature_placeholder = tf.placeholder(tf.float32)
102 | 
103 |     def build_generate_graph(self, reuse=True):
104 |         with tf.name_scope('generate'):
105 | 
106 |             generate_size = np.array([1])
107 |             generate_logits, self.generate_state = self.add_logits_op(
108 |                 self.data_placeholder, generate_size, reuse=reuse)
109 | 
110 |             temperature_logits = tf.div(generate_logits, self.temperature_placeholder)
111 |             self.generate_predictions = tf.nn.softmax(temperature_logits)
112 | 
113 |     def build_graph(self):
114 |         self.build_placeholders()
115 | 
116 |         with tf.name_scope('iterator'):
117 |             self.train_iterator = self.dataset.train_iterator
118 |             self.validation_iterator = self.dataset.validation_iterator
119 |             self.test_iterator = self.dataset.test_iterator
120 | 
121 |         with tf.name_scope('train_data'):
122 |             train_data, train_labels, train_size = self.train_iterator.get_next()
123 | 
124 |         with tf.name_scope('validation_data'):
125 |             (validation_data, validation_labels,
126 |                 validation_size) = self.validation_iterator.get_next()
127 | 
128 |         with tf.name_scope('test_data'):
129 |             test_data, test_labels, test_size = self.test_iterator.get_next()
130 | 
131 |         with tf.name_scope('train'):
132 |             train_logits, self.train_state = self.add_logits_op(train_data, train_size)
133 |             self.train_loss = self.add_loss_op(train_logits, train_labels, train_size)
134 |             train_l2_loss = self.add_l2_regularizer_op(self.train_loss)
135 |             self.train_op = self.add_train_op(train_l2_loss)
136 | 
137 |         with tf.name_scope('validation'):
138 |             validation_logits, self.validation_state = self.add_logits_op(
139 |                 validation_data, validation_size, reuse=True)
140 |             self.validation_loss = self.add_loss_op(
141 |                 validation_logits, validation_labels, validation_size)
142 | 
143 |         self.build_generate_graph()
144 | 


--------------------------------------------------------------------------------
/preprocessing/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lucasmoura/funk_generator/7b3b978d25ea8731cf37e519a41c59116f1fc0c6/preprocessing/__init__.py


--------------------------------------------------------------------------------
/preprocessing/dataset.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import random
  3 | import pickle
  4 | import re
  5 | 
  6 | from pathlib import Path
  7 | 
  8 | 
  9 | class MusicDataset:
 10 | 
 11 |     def __init__(self, data_folder, dataset_save_path):
 12 |         self.data_folder = Path(data_folder)
 13 |         self.dataset_save_path = self.data_folder / dataset_save_path
 14 | 
 15 |     def parse_song(self, song_text):
 16 |         parsed_song = []
 17 | 
 18 |         for lines in song_text.split('\n'):
 19 |             words = lines.split()
 20 | 
 21 |             if not words:
 22 |                 continue
 23 | 
 24 |             parsed_words = [words[0]]
 25 | 
 26 |             for word in words[1:]:
 27 |                 parsed_words.append(word.lower())
 28 | 
 29 |             parsed_song.append(' '.join(parsed_words))
 30 | 
 31 |         return '\n'.join(parsed_song)
 32 | 
 33 |     def read_song(self, song_path):
 34 |         with song_path.open() as song_file:
 35 |             return self.parse_song(song_file.read())
 36 | 
 37 |     def get_num_words_from_song(self, song):
 38 |         return len(song.replace('\n', ' ').split(' '))
 39 | 
 40 |     def format_song_text(self, song_text):
 41 |         song_text = re.sub(r",", "", song_text)
 42 |         song_text = re.sub(r"!", " ! ", song_text)
 43 |         song_text = re.sub(r"\?", " ? ", song_text)
 44 |         song_text = re.sub(r"\)", "", song_text)
 45 |         song_text = re.sub(r"\(", "", song_text)
 46 |         song_text = re.sub(r"\}", "", song_text)
 47 |         song_text = re.sub(r"\{", "", song_text)
 48 |         song_text = re.sub(r":", "", song_text)
 49 |         song_text = re.sub(r"\.", "  ", song_text)
 50 |         song_text = re.sub(r"\n", " \n ", song_text)
 51 |         song_text = re.sub(r'"', " ", song_text)
 52 |         song_text = re.sub(r"\[.*\]", " ", song_text)
 53 | 
 54 |         song_text = '<begin> ' + song_text + ' <end>'
 55 |         song_text = re.sub(r'\s{2,}', ' ', song_text)
 56 | 
 57 |         song_text = song_text.split(' ')
 58 | 
 59 |         return song_text
 60 | 
 61 |     def get_songs(self):
 62 |         self.all_songs = []
 63 | 
 64 |         for artist in os.listdir(str(self.data_folder)):
 65 |             artist_path = self.data_folder / artist
 66 | 
 67 |             if not artist_path.is_dir():
 68 |                 continue
 69 | 
 70 |             for song in os.listdir(str(artist_path)):
 71 |                 if song == 'song_codes.txt' or song == 'song_names.txt':
 72 |                     continue
 73 | 
 74 |                 song_path = artist_path / song
 75 |                 song_text = self.read_song(song_path)
 76 | 
 77 |                 song_text = self.format_song_text(song_text)
 78 |                 self.all_songs.append(song_text)
 79 | 
 80 |     def average_song_size(self):
 81 |         avg_size = 0
 82 |         for song in self.all_songs:
 83 |             avg_size += len(song)
 84 | 
 85 |         return avg_size / len(self.all_songs)
 86 | 
 87 |     def load_dataset(self, dataset_type):
 88 |         dataset_path = self.dataset_save_path / dataset_type
 89 | 
 90 |         with dataset_path.open(mode='rb') as dataset_file:
 91 |             return pickle.load(dataset_file)
 92 | 
 93 |     def load_datasets(self):
 94 |         if not self.dataset_save_path.is_dir():
 95 |             return False
 96 | 
 97 |         print('Loading datasets ...')
 98 |         self.all_songs = self.load_dataset('all_songs.pkl')
 99 |         self.train_dataset = self.load_dataset('train/raw_train.pkl')
100 |         self.validation_dataset = self.load_dataset('validation/raw_validation.pkl')
101 |         self.test_dataset = self.load_dataset('test/raw_test.pkl')
102 | 
103 |         return True
104 | 
105 |     def save_dataset(self, dataset, dataset_path):
106 |         with dataset_path.open(mode='wb') as dataset_file:
107 |             pickle.dump(dataset, dataset_file)
108 | 
109 |     def create_dirs(self):
110 |         if not self.dataset_save_path.is_dir():
111 |             self.dataset_save_path.mkdir()
112 | 
113 |         train_dataset = self.dataset_save_path / 'train'
114 |         if not train_dataset.is_dir():
115 |             train_dataset.mkdir()
116 | 
117 |         validation_dataset = self.dataset_save_path / 'validation'
118 |         if not validation_dataset.is_dir():
119 |             validation_dataset.mkdir()
120 | 
121 |         test_dataset = self.dataset_save_path / 'test'
122 |         if not test_dataset.is_dir():
123 |             test_dataset.mkdir()
124 | 
125 |         return train_dataset, validation_dataset, test_dataset
126 | 
127 |     def save_datasets(self):
128 | 
129 |         (train_dataset_path, validation_dataset_path,
130 |             test_dataset_path) = self.create_dirs()
131 | 
132 |         save_path = self.dataset_save_path / 'all_songs.pkl'
133 |         self.save_dataset(self.all_songs, save_path)
134 | 
135 |         train_dataset_path = train_dataset_path / 'raw_train.pkl'
136 |         self.save_dataset(self.train_dataset, train_dataset_path)
137 | 
138 |         validation_dataset_path = validation_dataset_path / 'raw_validation.pkl'
139 |         self.save_dataset(self.validation_dataset, validation_dataset_path)
140 | 
141 |         test_dataset_path = test_dataset_path / 'raw_test.pkl'
142 |         self.save_dataset(self.test_dataset, test_dataset_path)
143 | 
144 |     def split_dataset(self, validation_percent, test_percent):
145 |         total_size = validation_percent + test_percent
146 | 
147 |         random.shuffle(self.all_songs)
148 |         train_size = len(self.all_songs) - int(len(self.all_songs) * total_size)
149 |         self.train_dataset = self.all_songs[:train_size]
150 | 
151 |         validation_size = train_size + int(len(self.all_songs) * validation_percent)
152 |         self.validation_dataset = self.all_songs[train_size: validation_size]
153 | 
154 |         self.test_dataset = self.all_songs[validation_size:]
155 | 
156 |     def create_dataset(self, validation_percent, test_percent):
157 |         if not self.load_datasets():
158 |             print('Creating dataset ...')
159 |             self.get_songs()
160 |             self.split_dataset(validation_percent, test_percent)
161 |             self.save_datasets()
162 | 
163 |     def display_info(self):
164 |         print('Total number of songs: {}'.format(len(self.all_songs)))
165 |         print('Average size of songs: {} words'.format(int(self.average_song_size())))
166 | 
167 |         print('Train size: {}'.format(len(self.train_dataset)))
168 |         print('Validation size: {}'.format(len(self.validation_dataset)))
169 |         print('Test size: {}'.format(len(self.test_dataset)))
170 | 


--------------------------------------------------------------------------------
/preprocessing/text_preprocessing.py:
--------------------------------------------------------------------------------
 1 | import pickle
 2 | 
 3 | from tensorflow.contrib import learn
 4 | 
 5 | 
 6 | def get_vocabulary(text_array, min_frequency):
 7 |     def tokenizer_fn(iterator):
 8 |         return (x.split(' ') for x in iterator)
 9 | 
10 |     max_size = max([len(review) for review in text_array])
11 |     text_array = [' '.join(text) for text in text_array]
12 | 
13 |     vocabulary_processor = learn.preprocessing.VocabularyProcessor(
14 |         max_size, tokenizer_fn=tokenizer_fn, min_frequency=min_frequency)
15 | 
16 |     vocabulary_processor.fit(text_array)
17 | 
18 |     vocab = vocabulary_processor.vocabulary_._mapping
19 |     sorted_vocab = sorted(vocab.items(), key=lambda x: x[1])
20 | 
21 |     return sorted_vocab
22 | 
23 | 
24 | def create_word_dictionaties(vocab):
25 |     word2index = {word: index + 1 for (word, index) in vocab}
26 |     index2word = {index: word for word, index in word2index.items()}
27 | 
28 |     return word2index, index2word
29 | 
30 | 
31 | def save(data, data_path):
32 |     with data_path.open('wb') as data_file:
33 |         pickle.dump(data, data_file)
34 | 
35 | 
36 | def replace_unk_words(dataset, word2index):
37 |     for text in dataset:
38 |         for index_word, word in enumerate(text[:]):
39 |             if word not in word2index:
40 |                 text[index_word] = '<UNK>'
41 | 
42 | 
43 | def replace_words_with_ids(dataset, word2index):
44 |     word_id_dataset = []
45 | 
46 |     for data in dataset:
47 |         word_id = [word2index[word] for word in data]
48 |         word_id_dataset.append(word_id)
49 | 
50 |     return word_id_dataset
51 | 
52 | 
53 | def get_sizes_list(dataset):
54 |     return [len(data) for data in dataset]
55 | 
56 | 
57 | def create_chunks(dataset, chunk_max_size):
58 |     chunks_dataset = []
59 | 
60 |     for song in dataset:
61 |         chunks = [song[x:x + chunk_max_size]
62 |                   for x in range(0, len(song), chunk_max_size)]
63 |         chunks_dataset.extend(chunks)
64 | 
65 |     return chunks_dataset
66 | 
67 | 
68 | def create_labels(dataset):
69 |     new_dataset = []
70 |     labels = []
71 | 
72 |     for data in dataset:
73 |         new_dataset.append(data[0:-1])
74 |         labels.append(data[1:])
75 | 
76 |     return new_dataset, labels
77 | 


--------------------------------------------------------------------------------
/preprocessing/tfrecord.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | 
 3 | 
 4 | class SentenceTFRecord():
 5 |     def __init__(self, dataset, output_path):
 6 |         self.dataset = dataset
 7 |         self.output_path = output_path
 8 | 
 9 |     def parse_sentences(self):
10 |         writer = tf.python_io.TFRecordWriter(self.output_path)
11 | 
12 |         all_data, all_labels, all_sizes = self.dataset
13 | 
14 |         for data, labels, size in zip(all_data, all_labels, all_sizes):
15 |             example = self.make_example(data, labels, size)
16 |             writer.write(example.SerializeToString())
17 | 
18 |         writer.close()
19 | 
20 |     def make_example(self, data, labels, size):
21 |         example = tf.train.SequenceExample()
22 | 
23 |         example.context.feature['size'].int64_list.value.append(size)
24 | 
25 |         sentence_tokens = example.feature_lists.feature_list['tokens']
26 |         labels_tokens = example.feature_lists.feature_list['labels']
27 | 
28 |         for (token, label) in zip(data, labels):
29 |             sentence_tokens.feature.add().int64_list.value.append(int(token))
30 |             labels_tokens.feature.add().int64_list.value.append(int(label))
31 | 
32 |         return example
33 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | tensorflow-gpu==1.4.1
2 | beautifulsoup4
3 | requests
4 | flask
5 | flask-corps
6 | nltk
7 | wordcloud
8 | flake8
9 | 


--------------------------------------------------------------------------------
/sample.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import os
 3 | 
 4 | from collections import defaultdict
 5 | 
 6 | from model.rnn import RecurrentModel, RecurrentConfig
 7 | from model.song_generator import GreedySongGenerator
 8 | from utils.session_manager import initialize_session
 9 | 
10 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
11 | 
12 | 
13 | def create_argparse():
14 |     argument_parser = argparse.ArgumentParser()
15 | 
16 |     argument_parser.add_argument('-chp',
17 |                                  '--checkpoint-path',
18 |                                  type=str,
19 |                                  help="The path to save model's checkpoint")
20 | 
21 |     argument_parser.add_argument('-i2w',
22 |                                  '--index2word-path',
23 |                                  type=str,
24 |                                  help='Location of the index2word dict')
25 | 
26 |     argument_parser.add_argument('-w2i',
27 |                                  '--word2index-path',
28 |                                  type=str,
29 |                                  help='Location of word2index dict')
30 | 
31 |     argument_parser.add_argument('-vs',
32 |                                  '--vocab-size',
33 |                                  type=int,
34 |                                  help='Size of the vocabulary')
35 | 
36 |     argument_parser.add_argument('-es',
37 |                                  '--embedding-size',
38 |                                  type=int,
39 |                                  help='Dimension of the embedding matrix')
40 | 
41 |     argument_parser.add_argument('-nl',
42 |                                  '--num-layers',
43 |                                  type=int,
44 |                                  help='Number of lstm layers to use')
45 | 
46 |     argument_parser.add_argument('-nu',
47 |                                  '--num-units',
48 |                                  type=int,
49 |                                  help='Number of units to use in the lstm cell')
50 | 
51 |     argument_parser.add_argument('-t',
52 |                                  '--temperature',
53 |                                  type=float,
54 |                                  help='Logits temperature')
55 | 
56 |     return argument_parser
57 | 
58 | 
59 | def main():
60 |     argument_parser = create_argparse()
61 |     user_args = vars(argument_parser.parse_args())
62 |     user_args['use_checkpoint'] = True
63 |     prime_words = []
64 | 
65 |     user_args = defaultdict(int, user_args)
66 | 
67 |     config = RecurrentConfig(user_args)
68 |     model = RecurrentModel(None, config)
69 |     model.build_placeholders()
70 |     model.build_generate_graph(reuse=False)
71 | 
72 |     with initialize_session(config, use_gpu=False) as (sess, saver):
73 |         generator = GreedySongGenerator(model)
74 |         temperature = user_args['temperature']
75 | 
76 |         print('Generating song (Greedy) ...')
77 |         print(generator.generate(sess, prime_words=prime_words, temperature=temperature))
78 | 
79 | 
80 | if __name__ == '__main__':
81 |     main()
82 | 


--------------------------------------------------------------------------------
/scripts/create_sample.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | set -e
 4 | 
 5 | #usage: ./scripts/create_sample.sh
 6 | 
 7 | ALL_INDEX2WORD_PATH='data/song_dataset/index2word.pkl'
 8 | ALL_WORD2INDEX_PATH='data/song_dataset/word2index.pkl'
 9 | ALL_CHECKPOINT_PATH='good_checkpoint'
10 | ALL_VOCAB_SIZE=12399
11 | 
12 | KONDZILLA_INDEX2WORD_PATH='kondzilla/song_dataset/index2word.pkl'
13 | KONDZILLA_WORD2INDEX_PATH='kondzilla/song_dataset/word2index.pkl'
14 | KONDZILLA_CHECKPOINT_PATH='kondzilla_checkpoint'
15 | KONDZILLA_VOCAB_SIZE=2192
16 | 
17 | PROIBIDAO_INDEX2WORD_PATH='proibidao-songs/song_dataset/index2word.pkl'
18 | PROIBIDAO_WORD2INDEX_PATH='proibidao-songs/song_dataset/word2index.pkl'
19 | PROIBIDAO_CHECKPOINT_PATH='proibidao_checkpoint'
20 | PROIBIDAO_VOCAB_SIZE=1445
21 | 
22 | OSTENTACAO_INDEX2WORD_PATH='ostentacao-songs/song_dataset/index2word.pkl'
23 | OSTENTACAO_WORD2INDEX_PATH='ostentacao-songs/song_dataset/word2index.pkl'
24 | OSTENTACAO_CHECKPOINT_PATH='ostentacao_checkpoint'
25 | OSTENTACAO_VOCAB_SIZE=2035
26 | 
27 | EMBEDDING_SIZE=300
28 | NUM_LAYERS=3
29 | NUM_UNITS=728
30 | TEMPERATURE=0.7
31 | 
32 | PARAM=${1:-all}
33 | if [ $PARAM == "all" ]; then
34 |     echo "Creating sample for all songs model"
35 | 	INDEX2WORD_PATH=$ALL_INDEX2WORD_PATH
36 | 	WORD2INDEX_PATH=$ALL_WORD2INDEX_PATH
37 | 	CHECKPOINT_PATH=$ALL_CHECKPOINT_PATH
38 | 	VOCAB_SIZE=$ALL_VOCAB_SIZE
39 | elif [ $PARAM == "kondzilla" ]; then
40 |     echo "Creating sample for kondzilla songs model"
41 | 	INDEX2WORD_PATH=$KONDZILLA_INDEX2WORD_PATH
42 | 	WORD2INDEX_PATH=$KONDZILLA_WORD2INDEX_PATH
43 | 	CHECKPOINT_PATH=$KONDZILLA_CHECKPOINT_PATH
44 | 	VOCAB_SIZE=$KONDZILLA_VOCAB_SIZE
45 | elif [ $PARAM == "proibidao" ]; then
46 |     echo "Creating sample for probidao songs model"
47 | 	INDEX2WORD_PATH=$PROIBIDAO_INDEX2WORD_PATH
48 | 	WORD2INDEX_PATH=$PROIBIDAO_WORD2INDEX_PATH
49 | 	CHECKPOINT_PATH=$PROIBIDAO_CHECKPOINT_PATH
50 | 	VOCAB_SIZE=$PROIBIDAO_VOCAB_SIZE
51 | elif [ $PARAM == "ostentacao" ]; then
52 |     echo "Creating sample for ostentacao songs model"
53 | 	INDEX2WORD_PATH=$OSTENTACAO_INDEX2WORD_PATH
54 | 	WORD2INDEX_PATH=$OSTENTACAO_WORD2INDEX_PATH
55 | 	CHECKPOINT_PATH=$OSTENTACAO_CHECKPOINT_PATH
56 | 	VOCAB_SIZE=$OSTENTACAO_VOCAB_SIZE
57 | fi
58 | 
59 | python -u sample.py \
60 |   --checkpoint-path=${CHECKPOINT_PATH} \
61 |   --index2word-path=${INDEX2WORD_PATH} \
62 |   --word2index-path=${WORD2INDEX_PATH} \
63 |   --vocab-size=${VOCAB_SIZE} \
64 |   --embedding-size=${EMBEDDING_SIZE} \
65 |   --num-layers=${NUM_LAYERS} \
66 |   --num-units=${NUM_UNITS} \
67 |   --temperature=${TEMPERATURE}
68 | 


--------------------------------------------------------------------------------
/scripts/create_wordcloud_graph.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | set -e
 4 | 
 5 | #usage: ./scripts/create_word_cloud.sh
 6 | 
 7 | ALL_SONGS_PATH='data/song_dataset/all_songs.pkl'
 8 | ALL_GRAPH_NAME='all-word-cloud-graph.png'
 9 | 
10 | KONDZILLA_SONGS_PATH='kondzilla/song_dataset/all_songs.pkl'
11 | KONDZILLA_GRAPH_NAME='kondzilla-word-cloud-graph.png'
12 | 
13 | PROIBIDAO_SONGS_PATH='proibidao-songs/song_dataset/all_songs.pkl'
14 | PROIBIDAO_GRAPH_NAME='proibidao-word-cloud-graph.png'
15 | 
16 | OSTENTACAO_SONGS_PATH='ostentacao-songs/song_dataset/all_songs.pkl'
17 | OSTENTACAO_GRAPH_NAME='ostentacao-word-cloud-graph.png'
18 | 
19 | 
20 | PARAM=${1:-all}
21 | if [ $PARAM == "all" ]; then
22 |     echo "Creating word cloud graph for all songs"
23 |     SONGS_PATH=$ALL_SONGS_PATH
24 |     GRAPH_NAME=$ALL_GRAPH_NAME
25 | elif [ $PARAM == "kondzilla" ]; then
26 |     echo "Creating word cloud graph for kondzilla songs"
27 |     SONGS_PATH=$KONDZILLA_SONGS_PATH
28 |     GRAPH_NAME=$KONDZILLA_GRAPH_NAME
29 | elif [ $PARAM == "proibidao" ]; then
30 |     echo "Creating word cloud graph for proibidao songs"
31 |     SONGS_PATH=$PROIBIDAO_SONGS_PATH
32 |     GRAPH_NAME=$PROIBIDAO_GRAPH_NAME
33 | elif [ $PARAM == "ostentacao" ]; then
34 |     echo "Creating word cloud graph for ostentacao songs"
35 |     SONGS_PATH=$OSTENTACAO_SONGS_PATH
36 |     GRAPH_NAME=$OSTENTACAO_GRAPH_NAME
37 | fi
38 | 
39 | python -u song_word_cloud.py \
40 |     --songs-path=${SONGS_PATH} \
41 |     --graph-name=${GRAPH_NAME}
42 | 


--------------------------------------------------------------------------------
/scripts/download_musics.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | #usage: ./scripts/download_musics.sh
 4 | 
 5 | set -e
 6 | 
 7 | DATA_FOLDER='data/'
 8 | KEY_FILE_PATH=$DATA_FOLDER'config'
 9 | CODE_FILES_NAME='song_codes.txt'
10 | 
11 | python vagalume_downloader.py \
12 |   --key-file-path=${KEY_FILE_PATH} \
13 |   --data-folder=${DATA_FOLDER} \
14 |   --code-files-name=${CODE_FILES_NAME}
15 | 


--------------------------------------------------------------------------------
/scripts/run_format_data.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | #usage: ./scripts/run_format_data.sh
 4 | 
 5 | set -e
 6 | 
 7 | DATA_FOLDER='data'
 8 | DATASET_SAVE_PATH='song_dataset'
 9 | MIN_FREQUENCY=5
10 | VALIDATION_PERCENT=0.0
11 | TEST_PERCENT=0.0
12 | 
13 | 
14 | python format_data.py \
15 |     --data-folder=${DATA_FOLDER} \
16 |     --dataset-save-path=${DATASET_SAVE_PATH} \
17 |     --min-frequency=${MIN_FREQUENCY} \
18 |     --validation-percent=${VALIDATION_PERCENT} \
19 |     --test-percent=${TEST_PERCENT}
20 | 


--------------------------------------------------------------------------------
/scripts/run_model.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | 
  3 | set -e
  4 | 
  5 | #usage: ./scripts/run_model.sh
  6 | 
  7 | ALL_TRAIN_FILE='data/song_dataset/train/train.tfrecord'
  8 | ALL_VALIDATION_FILE='data/song_dataset/validation/validation.tfrecord'
  9 | ALL_TEST_FILE='data/song_dataset/test/test.tfrecord'
 10 | ALL_INDEX2WORD_PATH='data/song_dataset/index2word.pkl'
 11 | ALL_WORD2INDEX_PATH='data/song_dataset/word2index.pkl'
 12 | ALL_CHECKPOINT_PATH='checkpoint'
 13 | ALL_VOCAB_SIZE=12551
 14 | 
 15 | KONDZILLA_TRAIN_FILE='kondzilla/song_dataset/train/train.tfrecord'
 16 | KONDZILLA_VALIDATION_FILE='kondzilla/song_dataset/validation/validation.tfrecord'
 17 | KONDZILLA_TEST_FILE='kondzilla/song_dataset/test/test.tfrecord'
 18 | KONDZILLA_INDEX2WORD_PATH='kondzilla/song_dataset/index2word.pkl'
 19 | KONDZILLA_WORD2INDEX_PATH='kondzilla/song_dataset/word2index.pkl'
 20 | KONDZILLA_CHECKPOINT_PATH='kondzilla_checkpoint'
 21 | KONDZILLA_VOCAB_SIZE=2192
 22 | 
 23 | PROIBIDAO_TRAIN_FILE='proibidao-songs/song_dataset/train/train.tfrecord'
 24 | PROIBIDAO_VALIDATION_FILE='proibidao-songs/song_dataset/validation/validation.tfrecord'
 25 | PROIBIDAO_TEST_FILE='proibidao-songs/song_dataset/test/test.tfrecord'
 26 | PROIBIDAO_INDEX2WORD_PATH='proibidao-songs/song_dataset/index2word.pkl'
 27 | PROIBIDAO_WORD2INDEX_PATH='proibidao-songs/song_dataset/word2index.pkl'
 28 | PROIBIDAO_CHECKPOINT_PATH='proibidao_checkpoint'
 29 | PROIBIDAO_VOCAB_SIZE=1445
 30 | 
 31 | OSTENTACAO_TRAIN_FILE='ostentacao-songs/song_dataset/train/train.tfrecord'
 32 | OSTENTACAO_VALIDATION_FILE='ostentacao-songs/song_dataset/validation/validation.tfrecord'
 33 | OSTENTACAO_TEST_FILE='ostentacao-songs/song_dataset/test/test.tfrecord'
 34 | OSTENTACAO_INDEX2WORD_PATH='ostentacao-songs/song_dataset/index2word.pkl'
 35 | OSTENTACAO_WORD2INDEX_PATH='ostentacao-songs/song_dataset/word2index.pkl'
 36 | OSTENTACAO_CHECKPOINT_PATH='ostentacao_checkpoint'
 37 | OSTENTACAO_VOCAB_SIZE=2035
 38 | 
 39 | USE_CHECKPOINT=1
 40 | 
 41 | LEARNING_RATE=0.002
 42 | NUM_EPOCHS=20
 43 | BATCH_SIZE=32
 44 | 
 45 | NUM_LAYERS=3
 46 | NUM_UNITS=728
 47 | EMBEDDING_SIZE=300
 48 | MIN_VAL=-1
 49 | MAX_VAL=1
 50 | 
 51 | EMBEDDING_DROPOUT=0.5
 52 | LSTM_OUTPUT_DROPOUT=0.5
 53 | LSTM_STATE_DROPOUT=0.5
 54 | LSTM_INPUT_DROPOUT=0.5
 55 | WEIGHT_DECAY=0.0000
 56 | 
 57 | NUM_BUCKETS=30
 58 | BUCKET_WIDTH=30
 59 | PREFETCH_BUFFER=8
 60 | PERFORM_SHUFFLE=1
 61 | 
 62 | PARAM=${1:-all}
 63 | if [ $PARAM == "all" ]; then
 64 |     echo "Running model for all songs"
 65 | 	TRAIN_FILE=$ALL_TRAIN_FILE
 66 | 	VALIDATION_FILE=$ALL_VALIDATION_FILE
 67 | 	TEST_FILE=$ALL_TEST_FILE
 68 | 	INDEX2WORD_PATH=$ALL_INDEX2WORD_PATH
 69 | 	WORD2INDEX_PATH=$ALL_WORD2INDEX_PATH
 70 | 	CHECKPOINT_PATH=$ALL_CHECKPOINT_PATH
 71 | 	VOCAB_SIZE=$ALL_VOCAB_SIZE
 72 | elif [ $PARAM == "kondzilla" ]; then
 73 |     echo "Running model for kondzilla songs"
 74 | 	TRAIN_FILE=$KONDZILLA_TRAIN_FILE
 75 | 	VALIDATION_FILE=$KONDZILLA_VALIDATION_FILE
 76 | 	TEST_FILE=$KONDZILLA_TEST_FILE
 77 | 	INDEX2WORD_PATH=$KONDZILLA_INDEX2WORD_PATH
 78 | 	WORD2INDEX_PATH=$KONDZILLA_WORD2INDEX_PATH
 79 | 	CHECKPOINT_PATH=$KONDZILLA_CHECKPOINT_PATH
 80 | 	VOCAB_SIZE=$KONDZILLA_VOCAB_SIZE
 81 | elif [ $PARAM == "proibidao" ]; then
 82 |     echo "Running model for probidao songs"
 83 | 	TRAIN_FILE=$PROIBIDAO_TRAIN_FILE
 84 | 	VALIDATION_FILE=$PROIBIDAO_VALIDATION_FILE
 85 | 	TEST_FILE=$PROIBIDAO_TEST_FILE
 86 | 	INDEX2WORD_PATH=$PROIBIDAO_INDEX2WORD_PATH
 87 | 	WORD2INDEX_PATH=$PROIBIDAO_WORD2INDEX_PATH
 88 | 	CHECKPOINT_PATH=$PROIBIDAO_CHECKPOINT_PATH
 89 | 	VOCAB_SIZE=$PROIBIDAO_VOCAB_SIZE
 90 | elif [ $PARAM == "ostentacao" ]; then
 91 |     echo "Running model for ostentacao songs"
 92 | 	TRAIN_FILE=$OSTENTACAO_TRAIN_FILE
 93 | 	VALIDATION_FILE=$OSTENTACAO_VALIDATION_FILE
 94 | 	TEST_FILE=$OSTENTACAO_TEST_FILE
 95 | 	INDEX2WORD_PATH=$OSTENTACAO_INDEX2WORD_PATH
 96 | 	WORD2INDEX_PATH=$OSTENTACAO_WORD2INDEX_PATH
 97 | 	CHECKPOINT_PATH=$OSTENTACAO_CHECKPOINT_PATH
 98 | 	VOCAB_SIZE=$OSTENTACAO_VOCAB_SIZE
 99 | fi
100 | 
101 | 
102 | python -u model.py \
103 |   --train-file=${TRAIN_FILE} \
104 |   --validation-file=${VALIDATION_FILE} \
105 |   --test-file=${TEST_FILE} \
106 |   --checkpoint-path=${CHECKPOINT_PATH} \
107 |   --use-checkpoint=${USE_CHECKPOINT} \
108 |   --index2word-path=${INDEX2WORD_PATH} \
109 |   --word2index-path=${WORD2INDEX_PATH} \
110 |   --num-epochs=${NUM_EPOCHS} \
111 |   --batch-size=${BATCH_SIZE} \
112 |   --learning-rate=${LEARNING_RATE} \
113 |   --num-layers=${NUM_LAYERS} \
114 |   --num-units=${NUM_UNITS} \
115 |   --vocab-size=${VOCAB_SIZE} \
116 |   --embedding-size=${EMBEDDING_SIZE} \
117 |   --embedding-dropout=${EMBEDDING_DROPOUT} \
118 |   --lstm-output-dropout=${LSTM_OUTPUT_DROPOUT} \
119 |   --lstm-input-dropout=${LSTM_INPUT_DROPOUT} \
120 |   --lstm-state-dropout=${LSTM_STATE_DROPOUT} \
121 |   --weight-decay=${WEIGHT_DECAY} \
122 |   --min-val=${MIN_VAL} \
123 |   --max-val=${MAX_VAL} \
124 |   --num-buckets=${NUM_BUCKETS} \
125 |   --bucket-width=${BUCKET_WIDTH} \
126 |   --prefetch-buffer=${PREFETCH_BUFFER} \
127 |   --perform-shuffle=${PERFORM_SHUFFLE}
128 | 


--------------------------------------------------------------------------------
/scripts/run_vagalume_crawler.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | #usage: ./scripts/run_vagalume_crawler.sh
 4 | 
 5 | set -e
 6 | 
 7 | DATA_FOLDER='data/'
 8 | ARTIST_LIST_PATH=$DATA_FOLDER'artist_list.txt'
 9 | 
10 | 
11 | python vagalume_crawler.py \
12 |     --data-folder=${DATA_FOLDER} \
13 |     --artist-list-path=${ARTIST_LIST_PATH}
14 | 


--------------------------------------------------------------------------------
/song_word_cloud.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import nltk
 3 | import pickle
 4 | 
 5 | from PIL import Image
 6 | import numpy as np
 7 | 
 8 | from wordcloud import WordCloud
 9 | 
10 | 
11 | def create_argparse():
12 |     argument_parser = argparse.ArgumentParser()
13 | 
14 |     argument_parser.add_argument('-sp',
15 |                                  '--songs-path',
16 |                                  type=str,
17 |                                  help='Location of the all songs pickle file')
18 | 
19 |     argument_parser.add_argument('-gn',
20 |                                  '--graph-name',
21 |                                  type=str,
22 |                                  help='Name of the word cloud graph')
23 | 
24 |     return argument_parser
25 | 
26 | 
27 | def create_songs_str(songs_path):
28 |     stopwords = nltk.corpus.stopwords.words('portuguese')
29 |     more_words = ['pra', 'tá', 'pode', 'tô', 'hoje', 'Então', 'então', 'agora',
30 |                   'tudo', 'porque', 'sempre', 'quero', 'quer', 'sei', 'Refrão',
31 |                   '2x', 'assim', 'aqui', 'todo', 'vai', 'vem', 'nóis', 'vou',
32 |                   'pro', 'ser', 'nois', 'ter', 'tao', 'la', 'tão', 'ta']
33 |     stopwords.extend(more_words)
34 | 
35 |     with open(songs_path, 'rb') as f:
36 |         all_songs = pickle.load(f)
37 | 
38 |     songs = [' '.join(s[1:-1]) for s in all_songs]
39 |     all_songs_str = '\n'.join(songs)
40 |     stopwords = set(stopwords)
41 | 
42 |     return all_songs_str, stopwords
43 | 
44 | 
45 | def create_word_cloud_graph(all_songs_str, stopwords, graph_name):
46 |     sarrada_mask = np.array(Image.open("masks/romano.png"))
47 |     wc = WordCloud(background_color="white", max_words=2000, mask=sarrada_mask,
48 |                    stopwords=stopwords)
49 |     wc.generate(all_songs_str)
50 |     wc.to_file(graph_name)
51 | 
52 | 
53 | def main():
54 |     argument_parser = create_argparse()
55 |     user_args = vars(argument_parser.parse_args())
56 |     songs_path = user_args['songs_path']
57 |     graph_name = user_args['graph_name']
58 | 
59 |     all_songs_str, stopwords = create_songs_str(songs_path)
60 |     create_word_cloud_graph(all_songs_str, stopwords, graph_name)
61 | 
62 | 
63 | if __name__ == '__main__':
64 |     main()
65 | 


--------------------------------------------------------------------------------
/utils/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lucasmoura/funk_generator/7b3b978d25ea8731cf37e519a41c59116f1fc0c6/utils/__init__.py


--------------------------------------------------------------------------------
/utils/progress_bar.py:
--------------------------------------------------------------------------------
  1 | import time
  2 | import sys
  3 | 
  4 | import numpy as np
  5 | 
  6 | 
  7 | class Progbar():
  8 |     """
  9 |     Progbar class copied from keras (https://github.com/fchollet/keras/)
 10 |     Displays a progress bar.
 11 |     # Arguments
 12 |         target: Total number of steps expected.
 13 |         interval: Minimum visual progress update interval (in seconds).
 14 |     """
 15 | 
 16 |     def __init__(self, target, width=30, verbose=1):
 17 |         self.width = width
 18 |         self.target = target
 19 |         self.sum_values = {}
 20 |         self.unique_values = []
 21 |         self.start = time.time()
 22 |         self.total_width = 0
 23 |         self.seen_so_far = 0
 24 |         self.verbose = verbose
 25 | 
 26 |     def update(self, current, values=None, exact=None):
 27 |         """
 28 |         Updates the progress bar.
 29 |         # Arguments
 30 |             current: Index of current step.
 31 |             values: List of tuples (name, value_for_last_step).
 32 |                 The progress bar will display averages for these values.
 33 |             exact: List of tuples (name, value_for_last_step).
 34 |                 The progress bar will display these values directly.
 35 |         """
 36 |         values = values or []
 37 |         exact = exact or []
 38 | 
 39 |         for k, v in values:
 40 |             if k not in self.sum_values:
 41 |                 self.sum_values[k] = [v * (current - self.seen_so_far), current - self.seen_so_far]
 42 |                 self.unique_values.append(k)
 43 |             else:
 44 |                 self.sum_values[k][0] += v * (current - self.seen_so_far)
 45 |                 self.sum_values[k][1] += (current - self.seen_so_far)
 46 |         for k, v in exact:
 47 |             if k not in self.sum_values:
 48 |                 self.unique_values.append(k)
 49 |             self.sum_values[k] = [v, 1]
 50 |         self.seen_so_far = current
 51 | 
 52 |         now = time.time()
 53 |         if self.verbose == 1:
 54 |             prev_total_width = self.total_width
 55 |             sys.stdout.write("\b" * prev_total_width)
 56 |             sys.stdout.write("\r")
 57 | 
 58 |             numdigits = int(np.floor(np.log10(self.target))) + 1
 59 |             barstr = '%%%dd/%%%dd [' % (numdigits, numdigits)
 60 |             bar = barstr % (current, self.target)
 61 |             prog = float(current)/self.target
 62 |             prog_width = int(self.width*prog)
 63 |             if prog_width > 0:
 64 |                 bar += ('='*(prog_width-1))
 65 |                 if current < self.target:
 66 |                     bar += '>'
 67 |                 else:
 68 |                     bar += '='
 69 |             bar += ('.'*(self.width-prog_width))
 70 |             bar += ']'
 71 |             sys.stdout.write(bar)
 72 |             self.total_width = len(bar)
 73 | 
 74 |             if current:
 75 |                 time_per_unit = (now - self.start) / current
 76 |             else:
 77 |                 time_per_unit = 0
 78 |             eta = time_per_unit*(self.target - current)
 79 |             info = ''
 80 |             if current < self.target:
 81 |                 info += ' - ETA: %ds' % eta
 82 |             else:
 83 |                 info += ' - %ds' % (now - self.start)
 84 |             for k in self.unique_values:
 85 |                 if isinstance(self.sum_values[k], list):
 86 |                     info += ' - %s: %.4f' % (
 87 |                         k, self.sum_values[k][0] / max(1, self.sum_values[k][1]))
 88 |                 else:
 89 |                     info += ' - %s: %s' % (k, self.sum_values[k])
 90 | 
 91 |             self.total_width += len(info)
 92 |             if prev_total_width > self.total_width:
 93 |                 info += ((prev_total_width-self.total_width) * " ")
 94 | 
 95 |             sys.stdout.write(info)
 96 |             sys.stdout.flush()
 97 | 
 98 |             if current >= self.target:
 99 |                 sys.stdout.write("\n")
100 | 
101 |         if self.verbose == 2:
102 |             if current >= self.target:
103 |                 info = '%ds' % (now - self.start)
104 |                 for k in self.unique_values:
105 |                     info += ' - %s: %.4f' % (
106 |                         k, self.sum_values[k][0] / max(1, self.sum_values[k][1]))
107 |                 sys.stdout.write(info + "\n")
108 | 
109 |     def add(self, n, values=None):
110 |         self.update(self.seen_so_far+n, values)
111 | 


--------------------------------------------------------------------------------
/utils/session_manager.py:
--------------------------------------------------------------------------------
 1 | import contextlib
 2 | import os
 3 | 
 4 | import tensorflow as tf
 5 | 
 6 | 
 7 | @contextlib.contextmanager
 8 | def initialize_session(user_config, use_gpu=True):
 9 |     if use_gpu:
10 |         config = tf.ConfigProto()
11 |         config.gpu_options.allow_growth = True
12 |     else:
13 |         config = tf.ConfigProto(device_count={'GPU': 0})
14 | 
15 |     checkpoint = tf.train.latest_checkpoint(user_config.checkpoint_path)
16 |     saver = tf.train.Saver()
17 | 
18 |     with tf.Session(config=config) as sess:
19 |         if user_config.use_checkpoint:
20 |             print('Load checkpoint: {}'.format(checkpoint))
21 |             saver.restore(sess, checkpoint)
22 |         else:
23 |             print('Creating new model')
24 |             if not os.path.exists(user_config.checkpoint_path):
25 |                 os.makedirs(user_config.checkpoint_path)
26 | 
27 |             sess.run(tf.global_variables_initializer())
28 | 
29 |         yield (sess, saver)
30 | 


--------------------------------------------------------------------------------
/vagalume_crawler.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | 
 3 | from crawler.music_crawler import MusicCrawler
 4 | 
 5 | 
 6 | def create_argparse():
 7 |     parser = argparse.ArgumentParser()
 8 | 
 9 |     parser.add_argument('-d',
10 |                         '--data-folder',
11 |                         type=str,
12 |                         help='Data folder path')
13 | 
14 |     parser.add_argument('-al',
15 |                         '--artist-list-path',
16 |                         type=str,
17 |                         help='Path of the file containing the artists')
18 | 
19 |     return parser
20 | 
21 | 
22 | def main():
23 |     parser = create_argparse()
24 |     user_args = vars(parser.parse_args())
25 | 
26 |     artist_list_path = user_args['artist_list_path']
27 |     data_folder = user_args['data_folder']
28 | 
29 |     music_crawler = MusicCrawler(artist_list_path, data_folder)
30 |     music_crawler.crawl_musics()
31 | 
32 | 
33 | if __name__ == '__main__':
34 |     main()
35 | 


--------------------------------------------------------------------------------
/vagalume_downloader.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | 
 3 | from crawler.music_crawler import MusicDownloader
 4 | 
 5 | 
 6 | def create_argparse():
 7 |     parser = argparse.ArgumentParser()
 8 | 
 9 |     parser.add_argument('-kfp',
10 |                         '--key-file-path',
11 |                         type=str,
12 |                         help='Location of the file containing the Vagalume API key')
13 | 
14 |     parser.add_argument('-df',
15 |                         '--data-folder',
16 |                         type=str,
17 |                         help='Location of the data files')
18 | 
19 |     parser.add_argument('-cfn',
20 |                         '--code-files-name',
21 |                         type=str,
22 |                         help='Name of the file that contains the music ids')
23 | 
24 |     return parser
25 | 
26 | 
27 | def main():
28 |     parser = create_argparse()
29 |     user_args = vars(parser.parse_args())
30 | 
31 |     key_file_path = user_args['key_file_path']
32 |     data_folder = user_args['data_folder']
33 |     code_files_name = user_args['code_files_name']
34 | 
35 |     music_downloader = MusicDownloader(key_file_path, data_folder, code_files_name)
36 |     music_downloader.download_all_songs()
37 | 
38 | 
39 | if __name__ == '__main__':
40 |     main()
41 | 


--------------------------------------------------------------------------------